ORC tables loading
Hi, I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster. I load data into an ORC table by reading the data from an external table on raw text files and using insert statement: INSERT into TABLE myorctab SELECT * FROM mytxttab; I ran a simple scale-up test to find out how the loading time increases as I double the size of data and nodes. I realized that the total time remains more or less the same (scales properly). I am just wondering why this is happening, as naively I think if I make the number of partitions and size of data double, the time should also be roughly double as the system needs to partition twice amount of data as it was doing before among twice number of partitions. Am I missing something here ? Thnx
Re: Getting dot files for DAGs
Thanks for suggesting, I never used Tez UI before, and learned about it yesterday. I am trying to find out how I can enable/use it. Apparently it needs some changes in the binary that I am using (I had built the binary for tez 0.7 almost 2 months ago). On Wed, Sep 30, 2015 at 10:27 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Why not use tez ui? > > Le jeu. 1 oct. 2015 à 2:29, James Pirz <james.p...@gmail.com> a écrit : > >> I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries. >> I am interested in checking DAGs for my queries visually, and I realized >> that I can do that by graphviz once I can get "dot" files of my DAGs. My >> issue is I can not find those files, they are not in the log directory of >> Yarn or Hadoop or under /tmp . >> >> Any hint as where I can find those files would be great. Do I need to add >> any settings to my tez-site.xml in-order to enable generating them ? >> >> Thanks. >> >
Getting dot files for DAGs
I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries. I am interested in checking DAGs for my queries visually, and I realized that I can do that by graphviz once I can get "dot" files of my DAGs. My issue is I can not find those files, they are not in the log directory of Yarn or Hadoop or under /tmp . Any hint as where I can find those files would be great. Do I need to add any settings to my tez-site.xml in-order to enable generating them ? Thanks.
Re: Getting dot files for DAGs
Thanks. I could locate them in the proper container's log directory and visualize them. I was at the wrong node, assuming that they would be available on any of the node, but they are really dumped in one of the nodes. On Wed, Sep 30, 2015 at 7:00 PM, Hitesh Shah <hit...@apache.org> wrote: > The .dot file is generated into the Tez Application Master’s container log > dir. Firstly, you need to figure out the yarn application in which the > query/Tez DAG ran. Once you have the applicationId, you can use one of > these 2 approaches: > > 1) Go to the YARN ResourceManager UI, find the application and click > through to the Application Master logs. The .dot file for the dag should be > visible there. > 2) Using the application Id ( if the application has completed), get the > yarn logs using “bin/yarn logs -applicationId ” - once you have the > logs, you will be able to find the contents of the .dot file within them. > This approach only works if you have YARN log aggregation enabled. > > thanks > — Hitesh > > > On Sep 30, 2015, at 5:29 PM, James Pirz <james.p...@gmail.com> wrote: > > > I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries. > > I am interested in checking DAGs for my queries visually, and I realized > that I can do that by graphviz once I can get "dot" files of my DAGs. My > issue is I can not find those files, they are not in the log directory of > Yarn or Hadoop or under /tmp . > > > > Any hint as where I can find those files would be great. Do I need to > add any settings to my tez-site.xml in-order to enable generating them ? > > > > Thanks. > >
Checking the number of Readers
I am using Hive 1.2.0 on Hadoop 2.6 (on a cluster with 10 machines) and I am trying to understand the performance of a full-table scan. I am running the following query: SELECT * FROM LINEITEM WHERE L_LINENUMBER < 0; and I am measuring its performance in different scenarios: using "MR vs. Tez" and with different table types/formats (an external table on text data, or ORC). My question is: What is the best way to check the number of readers (scanners) that Hive uses in parallel to read the data ? My data is in HDFS and on each node I have 1 datanode process running which writes its blocks into 3 separate paths (each path persists its data on a separate disk). I tried to get this info using "explain" or from the available consoles, but I could not find that. Checking the number of established connections to the data transfer port for datanode (using the command below) gives me 12, but I am not sure If I am looking at the correct metric: netstat -anp | grep -w 50010 | grep ESTABLISHED | wc -l Any help would be appreciated. Thnx
Aggregated Expression not in GROUP BY key
Hi, I am using Hive 1.2, and I am trying to run some queries based on TPCH schema. My original query is: SELECT N_NAME, AVERAGE(C_ACCTBAL) FROM customer JOIN nation on C_NATIONKEY=N_NATIONKEY GROUP BY N_NAME; for which I get: FAILED: SemanticException [Error 10025]: Line 1:15 Expression not in GROUP BY key 'C_ACCTBAL' It does not really make sense, as I am running an aggregation on an attribute which is not part of the group-by clause, which makes sure that each group eventually gets one single value for the output. In Hive's language manual we see that: ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy ) … When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well. and the example there is similar to what I have. I even simplified the query, and dropped the join, but it did not make a difference: SELECT C_NATIONKEY, AVERAGE(C_ACCTBAL) FROM customer GROUP BY C_NATIONKEY; FAILED: SemanticException [Error 10025]: Line 1:20 Expression not in GROUP BY key 'C_ACCTBAL' Can you please let me know if I am missing something here and this behavior is expected or not ? In case you need it, the schema for the tables looks like: hive describe customer; OK c_custkey int c_name string c_address string c_phone string c_acctbal double c_mktsegment string c_comment string c_nationkey int hive describe nation; OK n_nationkey int n_name string n_regionkey int n_comment string Thanks.
Re: Aggregated Expression not in GROUP BY key
Just a follow-up on the issue. It was really happening because of using AVERAGE() instead of AVG(). Sorry, but the error was mis-leading (It did not tell me that function name is invalid). I had borrowed the query from a benchmark spec, and they had used AVERAGE in their sql statements, and I failed to fix it for HiveQL. On Wed, Jul 29, 2015 at 5:03 PM, James Pirz james.p...@gmail.com wrote: Hi, I am using Hive 1.2, and I am trying to run some queries based on TPCH schema. My original query is: SELECT N_NAME, AVERAGE(C_ACCTBAL) FROM customer JOIN nation on C_NATIONKEY=N_NATIONKEY GROUP BY N_NAME; for which I get: FAILED: SemanticException [Error 10025]: Line 1:15 Expression not in GROUP BY key 'C_ACCTBAL' It does not really make sense, as I am running an aggregation on an attribute which is not part of the group-by clause, which makes sure that each group eventually gets one single value for the output. In Hive's language manual we see that: ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy ) … When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well. and the example there is similar to what I have. I even simplified the query, and dropped the join, but it did not make a difference: SELECT C_NATIONKEY, AVERAGE(C_ACCTBAL) FROM customer GROUP BY C_NATIONKEY; FAILED: SemanticException [Error 10025]: Line 1:20 Expression not in GROUP BY key 'C_ACCTBAL' Can you please let me know if I am missing something here and this behavior is expected or not ? In case you need it, the schema for the tables looks like: hive describe customer; OK c_custkey int c_name string c_address string c_phone string c_acctbal double c_mktsegment string c_comment string c_nationkey int hive describe nation; OK n_nationkey int n_name string n_regionkey int n_comment string Thanks.
Re: Hive 1.2.0 Unable to start metastore
Thanks ! There was a similar problem: Conflicting Jars, but between Hive and Spark. My eventual goal is running Spark with Hive's tables, and having Spark's libraries on my path as well, there were conflicting Jar files. I removed Spark libraries from my PATH and Hive's services (remote metastore) just started all well. For now I am good, but I am just wondering what is the correct way to fix this ? Once I wanna start Spark, I need to include its libraries to the PATH, and the conflicts seems inevitable. On Mon, Jun 8, 2015 at 12:09 PM, Slava Markeyev slava.marke...@upsight.com wrote: It sounds like you are running into a jar conflict between the hive packaged derby and hadoop distro packaged derby. Look for derby jars on your system to confirm. In the mean time try adding this to your hive-env.sh or hadoop-env.sh file: export HADOOP_USER_CLASSPATH_FIRST=true On Mon, Jun 8, 2015 at 11:52 AM, James Pirz james.p...@gmail.com wrote: I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running CentOS). I am able to start Hive CLI and run queries. But once I try to start Hive's metastore (I trying to use the builtin derby) using: hive --service metastore I keep getting Class Not Found Exceptions for org.apache.derby.jdbc.EmbeddedDriver (See below). I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib to the $PATH, and I see that there is derby-10.11.1.1.jar file under $HIVE_HOME/lib . In my hive-site.xml (under $HIVE_HOME/conf) I have: property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:;databaseName=metastore_db;create=true/value descriptionJDBC connect string for a JDBC metastore/description /property So I am not sure, why it can not find it. Any suggestion or hint would be highly appreciated. Here is the error: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory ... Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:379) at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.init(ConnectionFactoryImpl.java:85) -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev http://www.linkedin.com/in/slavamarkeyev
Re: Hive 1.2.0 Unable to start metastore
Thanks for sharing the issue. Currently I am using two different environment params to run my sessions: One for Hive and one for Spark (wout conflicting Jars being present at the same time), and this seemed to solve my issues. Although I have seen some issues, specially once I need to restart my metastore server. On Mon, Jun 8, 2015 at 1:11 PM, Slava Markeyev slava.marke...@upsight.com wrote: Sounds like you ran into this: https://issues.apache.org/jira/browse/HIVE-9198 On Mon, Jun 8, 2015 at 1:06 PM, James Pirz james.p...@gmail.com wrote: Thanks ! There was a similar problem: Conflicting Jars, but between Hive and Spark. My eventual goal is running Spark with Hive's tables, and having Spark's libraries on my path as well, there were conflicting Jar files. I removed Spark libraries from my PATH and Hive's services (remote metastore) just started all well. For now I am good, but I am just wondering what is the correct way to fix this ? Once I wanna start Spark, I need to include its libraries to the PATH, and the conflicts seems inevitable. On Mon, Jun 8, 2015 at 12:09 PM, Slava Markeyev slava.marke...@upsight.com wrote: It sounds like you are running into a jar conflict between the hive packaged derby and hadoop distro packaged derby. Look for derby jars on your system to confirm. In the mean time try adding this to your hive-env.sh or hadoop-env.sh file: export HADOOP_USER_CLASSPATH_FIRST=true On Mon, Jun 8, 2015 at 11:52 AM, James Pirz james.p...@gmail.com wrote: I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running CentOS). I am able to start Hive CLI and run queries. But once I try to start Hive's metastore (I trying to use the builtin derby) using: hive --service metastore I keep getting Class Not Found Exceptions for org.apache.derby.jdbc.EmbeddedDriver (See below). I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib to the $PATH, and I see that there is derby-10.11.1.1.jar file under $HIVE_HOME/lib . In my hive-site.xml (under $HIVE_HOME/conf) I have: property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.EmbeddedDriver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby:;databaseName=metastore_db;create=true/value descriptionJDBC connect string for a JDBC metastore/description /property So I am not sure, why it can not find it. Any suggestion or hint would be highly appreciated. Here is the error: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory ... Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:379) at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) at org.datanucleus.store.rdbms.ConnectionFactoryImpl.init(ConnectionFactoryImpl.java:85) -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev http://www.linkedin.com/in/slavamarkeyev -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev http://www.linkedin.com/in/slavamarkeyev