ORC tables loading

2015-11-16 Thread James Pirz
Hi,

I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster.
I load data into an ORC table by reading the data from an external table on
raw text files and using insert statement:

INSERT into TABLE myorctab SELECT * FROM mytxttab;

I ran a simple scale-up test to find out how the loading time increases as
I double the size of data and nodes. I realized that the total time remains
more or less the same (scales properly).

I am just wondering why this is happening, as naively I think if I make the
number of partitions and size of data double, the time should also be
roughly double as the system needs to partition twice amount of data as it
was doing before among twice number of partitions. Am I missing something
here ?

Thnx


Re: Getting dot files for DAGs

2015-10-01 Thread James Pirz
Thanks for suggesting, I never used Tez UI before, and learned about it
yesterday.
I am trying to find out how I can enable/use it. Apparently it needs some
changes in the binary that I am using (I had built the binary for tez 0.7
almost 2 months ago).




On Wed, Sep 30, 2015 at 10:27 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Why not use tez ui?
>
> Le jeu. 1 oct. 2015 à 2:29, James Pirz <james.p...@gmail.com> a écrit :
>
>> I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries.
>> I am interested in checking DAGs for my queries visually, and I realized
>> that I can do that by graphviz once I can get "dot" files of my DAGs. My
>> issue is I can not find those files, they are not in the log directory of
>> Yarn or Hadoop or under /tmp .
>>
>> Any hint as where I can find those files would be great. Do I need to add
>> any settings to my tez-site.xml in-order to enable generating them ?
>>
>> Thanks.
>>
>


Getting dot files for DAGs

2015-09-30 Thread James Pirz
I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries.
I am interested in checking DAGs for my queries visually, and I realized
that I can do that by graphviz once I can get "dot" files of my DAGs. My
issue is I can not find those files, they are not in the log directory of
Yarn or Hadoop or under /tmp .

Any hint as where I can find those files would be great. Do I need to add
any settings to my tez-site.xml in-order to enable generating them ?

Thanks.


Re: Getting dot files for DAGs

2015-09-30 Thread James Pirz
Thanks. I could locate them in the proper container's log directory and
visualize them.
I was at the wrong node, assuming that they would be available on any of
the node, but they are really dumped in one of the nodes.



On Wed, Sep 30, 2015 at 7:00 PM, Hitesh Shah <hit...@apache.org> wrote:

> The .dot file is generated into the Tez Application Master’s container log
> dir. Firstly, you need to figure out the yarn application in which the
> query/Tez DAG ran. Once you have the applicationId, you can use one of
> these 2 approaches:
>
> 1) Go to the YARN ResourceManager UI, find the application and click
> through to the Application Master logs. The .dot file for the dag should be
> visible there.
> 2) Using the application Id ( if the application has completed), get the
> yarn logs using “bin/yarn logs -applicationId ” - once you have the
> logs, you will be able to find the contents of the .dot file within them.
> This approach only works if you have YARN log aggregation enabled.
>
> thanks
> — Hitesh
>
>
> On Sep 30, 2015, at 5:29 PM, James Pirz <james.p...@gmail.com> wrote:
>
> > I am using Tez 0.7.0 on Hadopp 2.6 to run Hive queries.
> > I am interested in checking DAGs for my queries visually, and I realized
> that I can do that by graphviz once I can get "dot" files of my DAGs. My
> issue is I can not find those files, they are not in the log directory of
> Yarn or Hadoop or under /tmp .
> >
> > Any hint as where I can find those files would be great. Do I need to
> add any settings to my tez-site.xml in-order to enable generating them ?
> >
> > Thanks.
>
>


Checking the number of Readers

2015-09-11 Thread James Pirz
I am using Hive 1.2.0 on Hadoop 2.6 (on a cluster with 10 machines) and I
am trying to understand the performance of a full-table scan. I am running
the following query:

SELECT * FROM LINEITEM
WHERE L_LINENUMBER < 0;

and I am measuring its performance in different scenarios: using "MR vs.
Tez" and  with different table types/formats (an external table on text
data, or ORC).

My question is:
What is the best way to check the number of readers (scanners) that Hive
uses in parallel to read the data ?

My data is in HDFS and on each node I have 1 datanode process running which
writes its blocks into 3 separate paths (each path persists its data on a
separate disk).

I tried to get this info using "explain" or from the available consoles,
but I could not find that. Checking the number of established connections
to the data transfer port for datanode (using the command below) gives me
12, but I am not sure If I am looking at the correct metric:

netstat -anp | grep -w 50010 | grep ESTABLISHED | wc -l


Any help would be appreciated.

Thnx


Aggregated Expression not in GROUP BY key

2015-07-29 Thread James Pirz
Hi,

I am using Hive 1.2, and I am trying to run some queries based on TPCH
schema. My original query is:

SELECT N_NAME, AVERAGE(C_ACCTBAL)
FROM customer JOIN nation
on C_NATIONKEY=N_NATIONKEY
GROUP BY N_NAME;

for which I get:
FAILED: SemanticException [Error 10025]: Line 1:15 Expression not in GROUP
BY key 'C_ACCTBAL'

It does not really make sense, as I am running an aggregation on an
attribute which is not part of the group-by clause, which makes sure that
each group eventually gets one single value for the output. In Hive's
language manual we see that:
( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy )

 … When using group by clause, the select statement can only include
columns included in the group by clause. Of course, you can have as many
aggregation functions (e.g. count) in the select statement as well.

and the example there is similar to what I have.

I even simplified the query, and dropped the join, but it did not make a
difference:

SELECT C_NATIONKEY, AVERAGE(C_ACCTBAL)
FROM customer
GROUP BY C_NATIONKEY;

FAILED: SemanticException [Error 10025]: Line 1:20 Expression not in GROUP
BY key 'C_ACCTBAL'

Can you please let me know if I am missing something here and this behavior
is expected or not ?

In case you need it, the schema for the tables looks like:

hive describe customer;
OK
c_custkey   int
c_name   string
c_address   string
c_phone string
c_acctbal   double
c_mktsegment string
c_comment   string
c_nationkey int

hive describe nation;
OK
n_nationkey int
n_name   string
n_regionkey int
n_comment   string

Thanks.


Re: Aggregated Expression not in GROUP BY key

2015-07-29 Thread James Pirz
Just a follow-up on the issue.
It was really happening because of using AVERAGE() instead of AVG().
Sorry, but the error was mis-leading (It did not tell me that function name
is invalid).
I had borrowed the query from a benchmark spec, and they had used AVERAGE
in their sql statements, and I failed to fix it for HiveQL.



On Wed, Jul 29, 2015 at 5:03 PM, James Pirz james.p...@gmail.com wrote:

 Hi,

 I am using Hive 1.2, and I am trying to run some queries based on TPCH
 schema. My original query is:

 SELECT N_NAME, AVERAGE(C_ACCTBAL)
 FROM customer JOIN nation
 on C_NATIONKEY=N_NATIONKEY
 GROUP BY N_NAME;

 for which I get:
 FAILED: SemanticException [Error 10025]: Line 1:15 Expression not in GROUP
 BY key 'C_ACCTBAL'

 It does not really make sense, as I am running an aggregation on an
 attribute which is not part of the group-by clause, which makes sure that
 each group eventually gets one single value for the output. In Hive's
 language manual we see that:
 ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
 )

  … When using group by clause, the select statement can only include
 columns included in the group by clause. Of course, you can have as many
 aggregation functions (e.g. count) in the select statement as well.

 and the example there is similar to what I have.

 I even simplified the query, and dropped the join, but it did not make a
 difference:

 SELECT C_NATIONKEY, AVERAGE(C_ACCTBAL)
 FROM customer
 GROUP BY C_NATIONKEY;

 FAILED: SemanticException [Error 10025]: Line 1:20 Expression not in GROUP
 BY key 'C_ACCTBAL'

 Can you please let me know if I am missing something here and this
 behavior is expected or not ?

 In case you need it, the schema for the tables looks like:

 hive describe customer;
 OK
 c_custkey   int
 c_name   string
 c_address   string
 c_phone string
 c_acctbal   double
 c_mktsegment string
 c_comment   string
 c_nationkey int

 hive describe nation;
 OK
 n_nationkey int
 n_name   string
 n_regionkey int
 n_comment   string

 Thanks.



Re: Hive 1.2.0 Unable to start metastore

2015-06-08 Thread James Pirz
Thanks !
There was a similar problem: Conflicting Jars, but between Hive and Spark.
My eventual goal is running Spark with Hive's tables, and having Spark's
libraries on my path as well, there were conflicting Jar files.
I removed Spark libraries from my PATH and Hive's services (remote
metastore) just started all well.
For now I am good, but I am just wondering what is the correct way to fix
this ? Once I wanna start Spark, I need to include its libraries to the
PATH, and the conflicts seems inevitable.



On Mon, Jun 8, 2015 at 12:09 PM, Slava Markeyev slava.marke...@upsight.com
wrote:

 It sounds like you are running into a jar conflict between the hive
 packaged derby and hadoop distro packaged derby. Look for derby jars on
 your system to confirm.

 In the mean time try adding this to your hive-env.sh or hadoop-env.sh file:

 export HADOOP_USER_CLASSPATH_FIRST=true

 On Mon, Jun 8, 2015 at 11:52 AM, James Pirz james.p...@gmail.com wrote:

 I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running
 CentOS). I am able to start Hive CLI and run queries. But once I try to
 start Hive's metastore (I trying to use the builtin derby) using:

 hive --service metastore

 I keep getting Class Not Found Exceptions for
 org.apache.derby.jdbc.EmbeddedDriver (See below).

 I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib to
 the $PATH, and I see that there is derby-10.11.1.1.jar file under
 $HIVE_HOME/lib .

 In my hive-site.xml (under $HIVE_HOME/conf) I have:

 property
 namejavax.jdo.option.ConnectionDriverName/name
 valueorg.apache.derby.jdbc.EmbeddedDriver/value
 descriptionDriver class name for a JDBC metastore/description
   /property

 property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:derby:;databaseName=metastore_db;create=true/value
 descriptionJDBC connect string for a JDBC metastore/description
   /property

 So I am not sure, why it can not find it.
 Any suggestion or hint would be highly appreciated.


 Here is the error:

 javax.jdo.JDOFatalInternalException: Error creating transactional
 connection factory
 ...
 Caused by: java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.derby.jdbc.EmbeddedDriver
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at java.lang.Class.newInstance(Class.java:379)
 at
 org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
 at
 org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.init(ConnectionFactoryImpl.java:85)




 --

 Slava Markeyev | Engineering | Upsight

 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev



Re: Hive 1.2.0 Unable to start metastore

2015-06-08 Thread James Pirz
Thanks for sharing the issue.
Currently I am using two different environment params to run my sessions:
One for Hive and one for Spark (wout conflicting Jars being present at the
same time), and this seemed to solve my issues. Although I have seen some
issues, specially once I need to restart my metastore server.

On Mon, Jun 8, 2015 at 1:11 PM, Slava Markeyev slava.marke...@upsight.com
wrote:

 Sounds like you ran into this:
 https://issues.apache.org/jira/browse/HIVE-9198


 On Mon, Jun 8, 2015 at 1:06 PM, James Pirz james.p...@gmail.com wrote:

 Thanks !
 There was a similar problem: Conflicting Jars, but between Hive and
 Spark.
 My eventual goal is running Spark with Hive's tables, and having Spark's
 libraries on my path as well, there were conflicting Jar files.
 I removed Spark libraries from my PATH and Hive's services (remote
 metastore) just started all well.
 For now I am good, but I am just wondering what is the correct way to fix
 this ? Once I wanna start Spark, I need to include its libraries to the
 PATH, and the conflicts seems inevitable.



 On Mon, Jun 8, 2015 at 12:09 PM, Slava Markeyev 
 slava.marke...@upsight.com wrote:

 It sounds like you are running into a jar conflict between the hive
 packaged derby and hadoop distro packaged derby. Look for derby jars on
 your system to confirm.

 In the mean time try adding this to your hive-env.sh or hadoop-env.sh
 file:

 export HADOOP_USER_CLASSPATH_FIRST=true

 On Mon, Jun 8, 2015 at 11:52 AM, James Pirz james.p...@gmail.com
 wrote:

 I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running
 CentOS). I am able to start Hive CLI and run queries. But once I try to
 start Hive's metastore (I trying to use the builtin derby) using:

 hive --service metastore

 I keep getting Class Not Found Exceptions for
 org.apache.derby.jdbc.EmbeddedDriver (See below).

 I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib
 to the $PATH, and I see that there is derby-10.11.1.1.jar file under
 $HIVE_HOME/lib .

 In my hive-site.xml (under $HIVE_HOME/conf) I have:

 property
 namejavax.jdo.option.ConnectionDriverName/name
 valueorg.apache.derby.jdbc.EmbeddedDriver/value
 descriptionDriver class name for a JDBC metastore/description
   /property

 property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:derby:;databaseName=metastore_db;create=true/value
 descriptionJDBC connect string for a JDBC metastore/description
   /property

 So I am not sure, why it can not find it.
 Any suggestion or hint would be highly appreciated.


 Here is the error:

 javax.jdo.JDOFatalInternalException: Error creating transactional
 connection factory
 ...
 Caused by: java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.derby.jdbc.EmbeddedDriver
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at java.lang.Class.newInstance(Class.java:379)
 at
 org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
 at
 org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.init(ConnectionFactoryImpl.java:85)




 --

 Slava Markeyev | Engineering | Upsight

 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev





 --

 Slava Markeyev | Engineering | Upsight

 Find me on LinkedIn http://www.linkedin.com/in/slavamarkeyev
 http://www.linkedin.com/in/slavamarkeyev