Re: Measuring the execution time of Hive queries through Ambari

2020-06-30 Thread Mich Talebzadeh
Many thanks to all. We will consider these options one by one.

Regards,
.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 30 Jun 2020 at 16:35, Julien Tane  wrote:

> Hi Mich,
>
> Again Ambari is only a cluster management framework. Not a complete GUI.
> It can have plugins like things like views.
>
>
> On Ambari you can have different Stacks. which correspond to the available
> services for a given services, and which you can install, start and stop.
>
>
> First of all, which stack is installed (see Manage Ambari > Versions  has
> a list of stacks.
>
> on HDP 3.1 Tez is there Tez.ui is available in the code but the Tez-UI
> (ambari View) is not supported anymore by ambari.
>
>
> What we did is that we installed a tomcat. and in this tomcat we deployed
> the tez-ui war. We then set the right values for the yarn configuration...
>
> And it worked
>
>
> Under Manage Ambari >Views, you will find the Views which are installed.
> for instance we have:
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 22. Juni 2020 14:23:51
> *An:* user
> *Betreff:* Re: Measuring the execution time of Hive queries through Ambari
>
> Hi  Julien.
>
> It is as I see is standard Ambari. Has TEZ UI but when I run the query and
> check TEZ UI it says TEZ view is not deployed!
>
> Thanks
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 22 Jun 2020 at 13:10, Julien Tane  wrote:
>
>> Mich,
>>
>>
>> When you say, that you are using ambari to connect to hvie what do you
>> mean by that.
>>
>> Unless you added a view in ambari to perform query (as far as I know, not
>> in the vanilla ambari)
>>
>>
>> One thing you could more or less do is use the tez.ui (assuming you are
>> using tez)
>>
>> but here again this is not in the standard ambari (at least not the newer
>> versions)
>>
>> one other possibility (depending on how you configured yarn) would be to
>> use the yarn ui
>>
>> which should be accessible in the Yarn Tab from your ambari... But here,
>> it kinds of depends
>>
>> on how you configured your system.
>>
>>
>> Kind Regards,
>>
>>
>> J
>>
>>
>>
>>
>>
>> Julien Tane
>> Big Data Engineer
>>
>> [image: Tel.] +49 721 98993-393
>> [image: Fax] +49 721 98993-66
>> [image: E-Mail] j...@solute.de
>>
>> solute GmbH
>> Zeppelinstraße 15
>> 76185 Karlsruhe
>> Germany
>>
>>
>> [image: Logo Solute]
>>
>> Marken der solute GmbH | brands of solute GmbH
>> [image: Marken]
>> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
>> Webseite | www.solute.de
>> Sitz | Registered Office: Karlsruhe
>> Registergericht | Register Court: Amtsgericht Mannheim
>> Registernummer | Register No.: HRB 11

AW: Measuring the execution time of Hive queries through Ambari

2020-06-30 Thread Julien Tane
Hi Mich,

Again Ambari is only a cluster management framework. Not a complete GUI. It can 
have plugins like things like views.


On Ambari you can have different Stacks. which correspond to the available 
services for a given services, and which you can install, start and stop.


First of all, which stack is installed (see Manage Ambari > Versions  has a 
list of stacks.

on HDP 3.1 Tez is there Tez.ui is available in the code but the Tez-UI (ambari 
View) is not supported anymore by ambari.


What we did is that we installed a tomcat. and in this tomcat we deployed the 
tez-ui war. We then set the right values for the yarn configuration...

And it worked


Under Manage Ambari >Views, you will find the Views which are installed. for 
instance we have:





Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]j...@solute.de<mailto:j...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php





Von: Mich Talebzadeh 
Gesendet: Montag, 22. Juni 2020 14:23:51
An: user
Betreff: Re: Measuring the execution time of Hive queries through Ambari

Hi  Julien.

It is as I see is standard Ambari. Has TEZ UI but when I run the query and 
check TEZ UI it says TEZ view is not deployed!

Thanks


[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Mon, 22 Jun 2020 at 13:10, Julien Tane 
mailto:j...@solute.de>> wrote:

Mich,


When you say, that you are using ambari to connect to hvie what do you mean by 
that.

Unless you added a view in ambari to perform query (as far as I know, not in 
the vanilla ambari)


One thing you could more or less do is use the tez.ui (assuming you are using 
tez)

but here again this is not in the standard ambari (at least not the newer 
versions)

one other possibility (depending on how you configured yarn) would be to use 
the yarn ui

which should be accessible in the Yarn Tab from your ambari... But here, it 
kinds of depends

on how you configured your system.


Kind Regards,


J





Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]j...@solute.de<mailto:j...@solute.de>


solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany



[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




Von: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Gesendet: Montag, 22. Juni 2020 12:57:27
An: user
Betreff: Measuring the execution time of Hive queries through Ambari

Hi,

Using Ambari to connect to Hive, is there any way of measuring the query time?

Please be aware that this is through Ambari not through beeline etc.

The tool we have at the moment is Ambari to Prod.

We do not have any other luxury!

Thanks



[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Re: Measuring the execution time of Hive queries through Ambari

2020-06-30 Thread Zoltan Haindrich

Hey Mich!

I don't know which version you use (HDP-3+?) - but you might want to see if "Data 
Analytics Studio" is available for that version; it could give similar insights as 
TezUI had.

cheers,
Zoltan

On 6/22/20 2:23 PM, Mich Talebzadeh wrote:

Hi  Julien.

It is as I see is standard Ambari. Has TEZ UI but when I run the query and 
check TEZ UI it says TEZ view is not deployed!

Thanks



LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/



*Disclaimer:* Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this 
email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




On Mon, 22 Jun 2020 at 13:10, Julien Tane mailto:j...@solute.de>> wrote:

Mich,


When you say, that you are using ambari to connect to hvie what do you mean 
by that.

Unless you added a view in ambari to perform query (as far as I know, not 
in the vanilla ambari)


One thing you could more or less do is use the tez.ui (assuming you are 
using tez)

but here again this is not in the standard ambari (at least not the newer 
versions)

one other possibility (depending on how you configured yarn) would be to 
use the yarn ui

which should be accessible in the Yarn Tab from your ambari... But here, it 
kinds of depends

on how you configured your system.


Kind Regards,


J



Julien Tane
Big Data Engineer
Tel.+49 721 98993-393
Fax +49 721 98993-66
E-Mail  j...@solute.de <mailto:j...@solute.de>

solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany

Logo Solute

Marken der solute GmbH | brands of solute GmbH
Marken

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798
*Informationen zum Datenschutz | Information about privacy policy*
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php



*Von:* Mich Talebzadeh mailto:mich.talebza...@gmail.com>>
*Gesendet:* Montag, 22. Juni 2020 12:57:27
*An:* user
*Betreff:* Measuring the execution time of Hive queries through Ambari
Hi,

Using Ambari to connect to Hive, is there any way of measuring the query 
time?

Please be aware that this is through Ambari not through beeline etc.

The tool we have at the moment is Ambari to Prod.

We do not have any other luxury!

Thanks




LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/



*Disclaimer:* Use it at your own risk.Any and all responsibility for any 
loss, damage or destruction of data or any other property which may arise from 
relying on this
email's technical content is explicitly disclaimed. The author will in no 
case be liable for any monetary damages arising from such loss, damage or 
destruction.



Re: Measuring the execution time of Hive queries through Ambari

2020-06-22 Thread Mich Talebzadeh
Hi  Julien.

It is as I see is standard Ambari. Has TEZ UI but when I run the query and
check TEZ UI it says TEZ view is not deployed!

Thanks



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Jun 2020 at 13:10, Julien Tane  wrote:

> Mich,
>
>
> When you say, that you are using ambari to connect to hvie what do you
> mean by that.
>
> Unless you added a view in ambari to perform query (as far as I know, not
> in the vanilla ambari)
>
>
> One thing you could more or less do is use the tez.ui (assuming you are
> using tez)
>
> but here again this is not in the standard ambari (at least not the newer
> versions)
>
> one other possibility (depending on how you configured yarn) would be to
> use the yarn ui
>
> which should be accessible in the Yarn Tab from your ambari... But here,
> it kinds of depends
>
> on how you configured your system.
>
>
> Kind Regards,
>
>
> J
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 22. Juni 2020 12:57:27
> *An:* user
> *Betreff:* Measuring the execution time of Hive queries through Ambari
>
> Hi,
>
> Using Ambari to connect to Hive, is there any way of measuring the query
> time?
>
> Please be aware that this is through Ambari not through beeline etc.
>
> The tool we have at the moment is Ambari to Prod.
>
> We do not have any other luxury!
>
> Thanks
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


AW: Measuring the execution time of Hive queries through Ambari

2020-06-22 Thread Julien Tane
Mich,


When you say, that you are using ambari to connect to hvie what do you mean by 
that.

Unless you added a view in ambari to perform query (as far as I know, not in 
the vanilla ambari)


One thing you could more or less do is use the tez.ui (assuming you are using 
tez)

but here again this is not in the standard ambari (at least not the newer 
versions)

one other possibility (depending on how you configured yarn) would be to use 
the yarn ui

which should be accessible in the Yarn Tab from your ambari... But here, it 
kinds of depends

on how you configured your system.


Kind Regards,


J





Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]j...@solute.de<mailto:j...@solute.de>


solute GmbH
Zeppelinstra?e 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Gesch?ftsf?hrer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php





Von: Mich Talebzadeh 
Gesendet: Montag, 22. Juni 2020 12:57:27
An: user
Betreff: Measuring the execution time of Hive queries through Ambari

Hi,

Using Ambari to connect to Hive, is there any way of measuring the query time?

Please be aware that this is through Ambari not through beeline etc.

The tool we have at the moment is Ambari to Prod.

We do not have any other luxury!

Thanks



[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]


LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Measuring the execution time of Hive queries through Ambari

2020-06-22 Thread Mich Talebzadeh
Hi,

Using Ambari to connect to Hive, is there any way of measuring the query
time?

Please be aware that this is through Ambari not through beeline etc.

The tool we have at the moment is Ambari to Prod.

We do not have any other luxury!

Thanks




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Running Hive queries from Ambari or from edge node via beeline

2020-06-16 Thread Mich Talebzadeh
many thanks Julien much appreciated.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 Jun 2020 at 22:09, Julien Tane  wrote:

> ambari is not a GUI interface to Hive... It is an Hadoop cluster managment
> tool.
>
>
> If you need a commandline compatible interface, you can use for instance
> using beeline.
>
>
> you can use a JDBC based GUI (like dbeaver) to access the data as long as
> the port is accessible.
>
>
> In older versions of ambari you could add some ambari views which had a
> query interface,
>
> though I do not know which work with the current versions of ambari.
>
>
> Kind Regards,
>
> J
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 15. Juni 2020 21:26:20
> *An:* user
> *Betreff:* Running Hive queries from Ambari or from edge node via beeline
>
> Hi,
>
> I am not a user of Ambari but I believe it is a GUI interface to Hive. it
> can be run on your laptop and connect to Hive via ODBC or JDBC.
>
> There is also another tool DB Visualizer Pro that uses JDBC to connect to
> Hive thrift server.
>
> My view is that if one is a developer the best best would be to have
> access to the edge node and connect through beeline (Hive thrift server).
> This could be through Putty or  SSH Tectia Secureshell but crucially  since
> one is running on the Hadoop cluster (the edge node is part of the cluster
> on the same V-LAN), then the performance is expected to be better?
>
> also both Tectia SSH and Putty are thin clients so you are effectively
> running the code on the edge node as opposed through the client-server.
>
> Does this make sense?
>
> Thanks
>
>
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


AW: Running Hive queries from Ambari or from edge node via beeline

2020-06-15 Thread Julien Tane
ambari is not a GUI interface to Hive... It is an Hadoop cluster managment tool.


If you need a commandline compatible interface, you can use for instance using 
beeline.


you can use a JDBC based GUI (like dbeaver) to access the data as long as the 
port is accessible.


In older versions of ambari you could add some ambari views which had a query 
interface,

though I do not know which work with the current versions of ambari.


Kind Regards,

J



Julien Tane
Big Data Engineer

[Tel.]  +49 721 98993-393
[Fax]   +49 721 98993-66
[E-Mail]j...@solute.de<mailto:j...@solute.de>


solute GmbH
Zeppelinstra?e 15
76185 Karlsruhe
Germany


[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]

Gesch?ftsf?hrer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798



Informationen zum Datenschutz | Information about privacy policy
https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php





Von: Mich Talebzadeh 
Gesendet: Montag, 15. Juni 2020 21:26:20
An: user
Betreff: Running Hive queries from Ambari or from edge node via beeline

Hi,

I am not a user of Ambari but I believe it is a GUI interface to Hive. it can 
be run on your laptop and connect to Hive via ODBC or JDBC.

There is also another tool DB Visualizer Pro that uses JDBC to connect to Hive 
thrift server.

My view is that if one is a developer the best best would be to have access to 
the edge node and connect through beeline (Hive thrift server). This could be 
through Putty or  SSH Tectia Secureshell but crucially  since one is running on 
the Hadoop cluster (the edge node is part of the cluster on the same V-LAN), 
then the performance is expected to be better?

also both Tectia SSH and Putty are thin clients so you are effectively running 
the code on the edge node as opposed through the client-server.

Does this make sense?

Thanks



Mich



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw





Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Running Hive queries from Ambari or from edge node via beeline

2020-06-15 Thread Mich Talebzadeh
Hi,

I am not a user of Ambari but I believe it is a GUI interface to Hive. it
can be run on your laptop and connect to Hive via ODBC or JDBC.

There is also another tool DB Visualizer Pro that uses JDBC to connect to
Hive thrift server.

My view is that if one is a developer the best best would be to have access
to the edge node and connect through beeline (Hive thrift server). This
could be through Putty or  SSH Tectia Secureshell but crucially  since one
is running on the Hadoop cluster (the edge node is part of the cluster on
the same V-LAN), then the performance is expected to be better?

also both Tectia SSH and Putty are thin clients so you are effectively
running the code on the edge node as opposed through the client-server.

Does this make sense?

Thanks



Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Clustering and Large-scale analysis of Hive Queries

2018-08-03 Thread Gopal Vijayaraghavan


> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except 
was implemented for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal 
is to pick this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I 
recommend looking at Zoltan's recent work which tracks query plans + runtime 
statistics from the RUNTIME_STATS table in the metastore.

You can debug through what this does by doing

"explain reoptimization  ;"

Cheers,
Gopal




Re: Clustering and Large-scale analysis of Hive Queries

2018-07-26 Thread Thai Bui
I don’t see any project especially tuned for Hive doing what you described.
I have encountered this problem recently as the number of users and queries
grew exponentially in my company and I’m interested.

We are currently collecting Hive internal metrics in order to do certain
analysis (don’t know what yet) in order to suggest better settings and/or
better querying pattern for our users. Mostly involving really large
queries that cause OOM error.

Hive also has an existing optimizer called cost-based optimizer (CBO) that
can perform query rewrite (mostly joins) to speed up queries based on
table/column statistics.

Another feature that could be beneficial is to identify common pattern of
existing queries to suggest a materialized view to build (also a new
feature of Hive 3.0). I think the Hive team is planning on this supporting
feature on the road map as well.

On Wed, Jul 25, 2018 at 3:27 PM Johannes Alberti 
wrote:

> Did you guys already look at Dr Elephant?
>
>
> https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
>
> Not sure if there is anything you might find useful, but I would be
> interested in hearing about good and bad about Dr Elephant w/ Hive.
>
> Sent from my iPhone
>
> On Jul 25, 2018, at 12:13 PM, Zheng Shao  wrote:
>
> Hi,
>
> I am interested in working on a project that takes a large number of Hive
> queries (as well as their meta data like amount of resources used etc) and
> find out common sub queries and expensive query groups etc.
>
> Are there any existing work in this domain?  Happy to collaborate as well
> if there are shared I interests.
>
> Zheng
>
> --
Thai


Re: Clustering and Large-scale analysis of Hive Queries

2018-07-25 Thread Johannes Alberti
Did you guys already look at Dr Elephant?

https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Not sure if there is anything you might find useful, but I would be interested 
in hearing about good and bad about Dr Elephant w/ Hive.

Sent from my iPhone

> On Jul 25, 2018, at 12:13 PM, Zheng Shao  wrote:
> 
> Hi,
> 
> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.
> 
> Are there any existing work in this domain?  Happy to collaborate as well if 
> there are shared I interests.
> 
> Zheng
> 


Clustering and Large-scale analysis of Hive Queries

2018-07-25 Thread Zheng Shao
Hi,

I am interested in working on a project that takes a large number of Hive
queries (as well as their meta data like amount of resources used etc) and
find out common sub queries and expensive query groups etc.

Are there any existing work in this domain?  Happy to collaborate as well
if there are shared I interests.

Zheng


Any hooks to invoke the custom database's statistics for aggregate hive queries

2017-09-12 Thread Amey Barve
Hi All,

We have developed a custom storgeHandler implementing *HiveStorageHandler*.
We also have Api's/statistics for totalCount, max, min etc. for the data
stored in our database.

See below example queries:
1. select count(*) from my_table;
2. select max(id_column) from my_table;

So for above queries instead of full table scan, storageHandler should be
able to invoke our totalCount, max, min etc. methods

So are there any hooks to invoke these statistics api's for aggregate hive
queries which can do simple look up of these statistics?

Thanks,
Amey


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Edward Capriolo
I have been contemplating attaching meta data for the query lineage to each
table such that I can know where the data came from and have a 1 click
regenerate button.

On Wed, Dec 21, 2016 at 3:02 PM, Stephen Sprague  wrote:

> my 2 cents. :)
>
> as soon as you say "complex query" i would submit you've lost the
> upperhand and you're behind the eight-ball right off the bat.  And you know
> this too otherwise you wouldn't have posted here. ha!
>
> i use cascading CTAS statements so that i can examine the intermediate
> tables.  Another approach is to use CTE's but while that makes things
> easier to read it's still one big query and you don't get insight to the
> "work" tables.
>
> yes, it could take longer execution time if those intermediate tables
> can't be run in parallel but small price to pay compared to human debug
> time in my book anyway.
>
> thoughts?
>
> Cheers,
> Stephen.
>
>
>
>
>
> On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hi Elliot,
>>
>> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
>> in the documentation I could not see any way to pass output generated by
>> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
>> platform for multiple big-data systems (http://www.hplsql.org/about).
>>
>> However in order to break single complex hive query, DDLs look to be only
>> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>>
>> -- Saumitra S. Shahapure
>>
>> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>>
>>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>>> I expect that is another approach that you could use to modularise:
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=59690156
>>> http://www.hplsql.org/doc
>>>
>>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>>
 Some options are covered here, although there is no definitive guidance
 as far as I know:

 https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
 g+Hive+SQL#UnitTestingHiveSQL-Modularisation

 On 15 December 2016 at 17:08, Saumitra Shahapure <
 saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is
> not very smooth. We would end up polluting metadata database by creating
> views / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries
> written in Hive? Are there better ways to break Hive query into multiple
> modules?
>
> -- Saumitra S. Shahapure
>


>>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Stephen Sprague
my 2 cents. :)

as soon as you say "complex query" i would submit you've lost the upperhand
and you're behind the eight-ball right off the bat.  And you know this too
otherwise you wouldn't have posted here. ha!

i use cascading CTAS statements so that i can examine the intermediate
tables.  Another approach is to use CTE's but while that makes things
easier to read it's still one big query and you don't get insight to the
"work" tables.

yes, it could take longer execution time if those intermediate tables can't
be run in parallel but small price to pay compared to human debug time in
my book anyway.

thoughts?

Cheers,
Stephen.





On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hi Elliot,
>
> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
> in the documentation I could not see any way to pass output generated by
> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
> platform for multiple big-data systems (http://www.hplsql.org/about).
>
> However in order to break single complex hive query, DDLs look to be only
> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>
> -- Saumitra S. Shahapure
>
> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>
>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>> I expect that is another approach that you could use to modularise:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
>> http://www.hplsql.org/doc
>>
>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>
>>> Some options are covered here, although there is no definitive guidance
>>> as far as I know:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>
>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>> saumitra.offic...@gmail.com> wrote:
>>>
 Hello,

 We are running and maintaining quite big and complex Hive SELECT query
 right now. It's basically a single SELECT query which performs JOIN of
 about ten other SELECT query outputs.

 A simplest way to refactor that we can think of is to break this query
 down into multiple views and then join the views. There is similar
 possibility to create intermediate tables.

 However creating multiple DDLs in order to maintain a single DML is not
 very smooth. We would end up polluting metadata database by creating views
 / intermediate tables which are used in just this ETL.

 What are the other efficient ways to maintain complex SQL queries
 written in Hive? Are there better ways to break Hive query into multiple
 modules?

 -- Saumitra S. Shahapure

>>>
>>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Saumitra Shahapure
Hi Elliot,

Thanks for letting me know. HPL-SQL sounded particularly interesting. But
in the documentation I could not see any way to pass output generated by
one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
platform for multiple big-data systems (http://www.hplsql.org/about).

However in order to break single complex hive query, DDLs look to be only
way in HPL-SQL too. Or is there any alternate way that I might have missed?

-- Saumitra S. Shahapure

On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:

> I notice that HPL/SQL is not mentioned on the page I referenced, however I
> expect that is another approach that you could use to modularise:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
> http://www.hplsql.org/doc
>
> On 15 December 2016 at 17:17, Elliot West  wrote:
>
>> Some options are covered here, although there is no definitive guidance
>> as far as I know:
>>
>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>
>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>> saumitra.offic...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We are running and maintaining quite big and complex Hive SELECT query
>>> right now. It's basically a single SELECT query which performs JOIN of
>>> about ten other SELECT query outputs.
>>>
>>> A simplest way to refactor that we can think of is to break this query
>>> down into multiple views and then join the views. There is similar
>>> possibility to create intermediate tables.
>>>
>>> However creating multiple DDLs in order to maintain a single DML is not
>>> very smooth. We would end up polluting metadata database by creating views
>>> / intermediate tables which are used in just this ETL.
>>>
>>> What are the other efficient ways to maintain complex SQL queries
>>> written in Hive? Are there better ways to break Hive query into multiple
>>> modules?
>>>
>>> -- Saumitra S. Shahapure
>>>
>>
>>
>


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
I notice that HPL/SQL is not mentioned on the page I referenced, however I
expect that is another approach that you could use to modularise:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
http://www.hplsql.org/doc

On 15 December 2016 at 17:17, Elliot West  wrote:

> Some options are covered here, although there is no definitive guidance as
> far as I know:
>
> https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#
> UnitTestingHiveSQL-Modularisation
>
> On 15 December 2016 at 17:08, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hello,
>>
>> We are running and maintaining quite big and complex Hive SELECT query
>> right now. It's basically a single SELECT query which performs JOIN of
>> about ten other SELECT query outputs.
>>
>> A simplest way to refactor that we can think of is to break this query
>> down into multiple views and then join the views. There is similar
>> possibility to create intermediate tables.
>>
>> However creating multiple DDLs in order to maintain a single DML is not
>> very smooth. We would end up polluting metadata database by creating views
>> / intermediate tables which are used in just this ETL.
>>
>> What are the other efficient ways to maintain complex SQL queries written
>> in Hive? Are there better ways to break Hive query into multiple modules?
>>
>> -- Saumitra S. Shahapure
>>
>
>


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
Some options are covered here, although there is no definitive guidance as
far as I know:

https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#UnitTestingHiveSQL-Modularisation

On 15 December 2016 at 17:08, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is not
> very smooth. We would end up polluting metadata database by creating views
> / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries written
> in Hive? Are there better ways to break Hive query into multiple modules?
>
> -- Saumitra S. Shahapure
>


Maintaining big and complex Hive queries

2016-12-15 Thread Saumitra Shahapure
Hello,

We are running and maintaining quite big and complex Hive SELECT query
right now. It's basically a single SELECT query which performs JOIN of
about ten other SELECT query outputs.

A simplest way to refactor that we can think of is to break this query down
into multiple views and then join the views. There is similar possibility
to create intermediate tables.

However creating multiple DDLs in order to maintain a single DML is not
very smooth. We would end up polluting metadata database by creating views
/ intermediate tables which are used in just this ETL.

What are the other efficient ways to maintain complex SQL queries written
in Hive? Are there better ways to break Hive query into multiple modules?

-- Saumitra S. Shahapure


Re: Hive queries rejected under heavy load

2016-09-28 Thread Stephen Sprague
gotta start by looking at the logs and run the local client to eliminate
HS2.   perhaps running hive as such:

$ hive -hiveconf hive.root.logger=DEBUG,console

do you see any smoking gun?

On Wed, Sep 28, 2016 at 7:34 AM, Jose Rozanec  wrote:

> Hi,
>
> We have a Hive cluster (Hive 2.1.0+Tez 0.8.4) which works well for most
> queries. Though for some heavy ones we observe that sometimes are able to
> execute and sometimes get rejected. We are not sure why we get a rejection
> instead of getting them enqueued and wait for execution until resources in
> cluster are available again. We notice that the connection waits for a
> minute, and if fails to assign resources, will drop the query.
> Looking at configuration parameters, is not clear to us if this can be
> changed. Did anyone had a similar experience and can provide us some
> guidance?
>
> Thank you in advance,
>
> Joze.
>
>
>


Hive queries rejected under heavy load

2016-09-28 Thread Jose Rozanec
Hi,

We have a Hive cluster (Hive 2.1.0+Tez 0.8.4) which works well for most
queries. Though for some heavy ones we observe that sometimes are able to
execute and sometimes get rejected. We are not sure why we get a rejection
instead of getting them enqueued and wait for execution until resources in
cluster are available again. We notice that the connection waits for a
minute, and if fails to assign resources, will drop the query.
Looking at configuration parameters, is not clear to us if this can be
changed. Did anyone had a similar experience and can provide us some
guidance?

Thank you in advance,

Joze.


Running multiple hive queries in the same jvm

2016-09-22 Thread rahul challapalli
Team,

I want to know whether there is any way in which I can run 3 hive queries
sequentially in a single jvm. From the docs, I found that setting "
mapreduce.framework.name=local" might achieve what I am looking for. Can
some one confirm?

- Rahul


Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi Jörn,

We will be upgrading to MapR 5.1, Hive 1.2, and Spark 1.6.1 at the end of
June.

In the meantime, still can this be done with these versions?
There is not a firewall issue since we have edge nodes and cluster nodes
hosted in the same location with the same NFS mount.



On Thu, May 26, 2016 at 1:34 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Both have outdated versions, usually one can support you better if you
> upgrade to the newest.
> Firewall could be an issue here.
>
>
> On 26 May 2016, at 10:11, Nikolay Voronchikhin <nvoronchik...@gmail.com>
> wrote:
>
> Hi PySpark users,
>
> We need to be able to run large Hive queries in PySpark 1.2.1. Users are
> running PySpark on an Edge Node, and submit jobs to a Cluster that
> allocates YARN resources to the clients.
> We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark
> 1.2.1.
>
>
> Currently, our process for writing queries works only for small result
> sets, for example:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table limit
> 10").collect()*
> *results*
> 
>
>
> How do I save the HiveQL query to RDD first, then output the results?
>
> This is the error I get when running a query that requires output of
> 400,000 rows:
> *from pyspark.sql import HiveContext*
> *sqlContext = HiveContext(sc)*
> *results = sqlContext.sql("select column from database.table").collect()*
> *results*
> ...
>
> /path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)   1976 
> """   1977 with SCCallSiteSync(self.context) as css:-> 1978   
>   bytesInJava = 
> self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()   1979 
> cls = _create_cls(self.schema())   1980 return map(cls, 
> self._collect_iterator_through_file(bytesInJava))
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)536 answer = 
> self.gateway_client.send_command(command)537 return_value = 
> get_return_value(answer, self.gateway_client,--> 538 
> self.target_id, self.name)539 540 for temp_arg in temp_args:
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)298  
>raise Py4JJavaError(299 'An error occurred 
> while calling {0}{1}{2}.\n'.--> 300 format(target_id, 
> '.', name), value)301 else:302 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o76.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Failed to connect 
> to cluster_node/IP_address:port
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurren

Re: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Jörn Franke
Both have outdated versions, usually one can support you better if you upgrade 
to the newest.
Firewall could be an issue here.


> On 26 May 2016, at 10:11, Nikolay Voronchikhin <nvoronchik...@gmail.com> 
> wrote:
> 
> Hi PySpark users,
> 
> We need to be able to run large Hive queries in PySpark 1.2.1. Users are 
> running PySpark on an Edge Node, and submit jobs to a Cluster that allocates 
> YARN resources to the clients.
> We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark 
> 1.2.1.
> 
> 
> Currently, our process for writing queries works only for small result sets, 
> for example:
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> results = sqlContext.sql("select column from database.table limit 
> 10").collect()
> results
> 
> 
> 
> How do I save the HiveQL query to RDD first, then output the results?
> 
> This is the error I get when running a query that requires output of 400,000 
> rows:
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> results = sqlContext.sql("select column from database.table").collect()
> results
> ...
> /path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)
>1976 """
>1977 with SCCallSiteSync(self.context) as css:
> -> 1978 bytesInJava = 
> self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()
>1979 cls = _create_cls(self.schema())
>1980 return map(cls, 
> self._collect_iterator_through_file(bytesInJava))
> 
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> 
> /path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling o76.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Failed to connect 
> to cluster_node/IP_address:port
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> For this example, ideally, this query should output the 400,000 row resultset.
> 
> 
> Thanks for your help,
> Nikolay Voronchikhin
> https://www.linkedin.com/in/nvoronchikhin
> E-mail: nvoronchik...@gmail.com
> 
> 


Fwd: How to run large Hive queries in PySpark 1.2.1

2016-05-26 Thread Nikolay Voronchikhin
Hi PySpark users,

We need to be able to run large Hive queries in PySpark 1.2.1. Users are
running PySpark on an Edge Node, and submit jobs to a Cluster that
allocates YARN resources to the clients.
We are using MapR as the Hadoop Distribution on top of Hive 0.13 and Spark
1.2.1.


Currently, our process for writing queries works only for small result
sets, for example:
*from pyspark.sql import HiveContext*
*sqlContext = HiveContext(sc)*
*results = sqlContext.sql("select column from database.table limit
10").collect()*
*results*



How do I save the HiveQL query to RDD first, then output the results?

This is the error I get when running a query that requires output of
400,000 rows:
*from pyspark.sql import HiveContext*
*sqlContext = HiveContext(sc)*
*results = sqlContext.sql("select column from database.table").collect()*
*results*
...

/path/to/mapr/spark/spark-1.2.1/python/pyspark/sql.py in collect(self)
  1976 """   1977 with SCCallSiteSync(self.context) as
css:-> 1978 bytesInJava =
self._jschema_rdd.baseSchemaRDD().collectToPython().iterator()   1979
   cls = _create_cls(self.schema())   1980 return map(cls,
self._collect_iterator_through_file(bytesInJava))
/path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)536 answer =
self.gateway_client.send_command(command)537 return_value
= get_return_value(answer, self.gateway_client,--> 538
self.target_id, self.name)539 540 for temp_arg in
temp_args:
/path/to/mapr/spark/spark-1.2.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)298
 raise Py4JJavaError(299 'An error
occurred while calling {0}{1}{2}.\n'.--> 300
format(target_id, '.', name), value)301 else:302
  raise Py4JError(
Py4JJavaError: An error occurred while calling o76.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result: java.io.IOException: Failed to
connect to cluster_node/IP_address:port
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




For this example, ideally, this query should output the 400,000 row
resultset.


Thanks for your help,
*Nikolay Voronchikhin*
https://www.linkedin.com/in/nvoronchikhin

*E-mail: nvoronchik...@gmail.com <nvoronchik...@gmail.com>*

* <nvoronchik...@gmail.com>*


Varying vcores/ram for hive queries running Tez engine

2016-04-25 Thread Nitin Kumar
I was trying to benchmark some hive queries. I am using the tez execution
engine. I varied the values of the following properties:

   1.

   hive.tez.container.size
   2.

   tez.task.resource.memory.mb
   3.

   tez.task.resource.cpu.vcores

Changes in values for property 1 is reflected properly. However it seems
that hive does not respect changes in values of property 3; it always
allocates one vcore per requested container (RM is configured to use the
DominantResourceCalculator). This got me thinking about the precedence of
property values in hive and tez.

I have the following questions with respect to these configurations

   1.

   Does hive respect the set values for the properties 2 and 3 at all?
   2.

   If I set property 1 to a value say 2048 MB and property 2 is set to a
   value of say 1024 MB does this mean that I am wasting about a GB of memory
   for each spawned container?
   3.

   Is there a property in hive similar to property 1 that allows me to use
   the 'set' command in the .hql file to specify the number of vcores to use
   per container?
   4.

   Changes in value for the property tez.am.resource.cpu.vcores are
   reflected at runtime. However I do not observe the same behaviour with
   property 3. Are there other configurations that take precedence over it?

Your inputs and suggestions would be highly appreciated.

Thanks!


PS: Tests conducted on a 5 node cluster running HDP 2.3.0


RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
I'm not aware of any particular reason that this shouldn't "inherently" work, 
but for debugging purposes I'd be wondering about the nested environment 
variables related to the hadoop job.the bash shell where you are trying to 
launch subsequent hive queries already has pre-existing hadoop job environment 
variables declared in the environment from the parent streaming job.I can't 
say for sure that there wouldn't be conflicts there.  So while I don't know of 
any reason that it definitely won't work, I know that you are venturing into 
uncharted territory and you may uncover unexpected edge-cases.


From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
Sent: Monday, April 18, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am in 
fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each hive 
query is computing distinct on a toy table of 10 records -- so, we are 
definitely not running into problems like resource contention. Also, I 
increased (streaming) mappers' task timeout value (to 1hr) so that I give ample 
time for shell script (i.e., hive query) to finish. So, architecturally, is 
there something that limits us spawning such nested MR jobs -- a streaming MR 
job spawning multiple hive queries that in turn spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
My $0.02

If you are running multiple concurrent queries on the data, you are probably 
doing it wrong (or at least inefficiently)although this somewhat depends on 
what type of files are backing your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you 
are NOT using Tez/Spark as your execution engine

Generally with HDFS, data I/O is going to be the slowest pieceso, with your 
workflow, each hive query is going to need to read ALL of the source data to 
resolve the query.  It would be much more efficient if you could write a more 
complex query that could read the source data 1 time (instead of however many 
parallel operations you are running)Additionally, from an efficiency 
perspective running queries in parallel might only help improve performance if 
each of your queries requires fewer map tasks than the total capacity of your 
clusterotherwise it would  generally be more efficient to run your queries 
in series.

If you schedule the work in series, and things get backed up, the job will 
still eventually complete.  If you attempt to do TOO much work in parallel, all 
of the jobs will start timing out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, 
but I'd encourage you to look at restructuring your approach to the problem to 
save you more headaches down the road.

From: Shirish Tatikonda 
[mailto:shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each 
entity is associated with its own data set. The processing involves a few hive 
queries to do joins and aggregations, which is followed by some code in Python. 
My thought process is to put the hive queries and python invocation in a shell 
script, and invoke the shell script on multiple entities in parallel through a 
streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job.
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same tas

Re: Mappers spawning Hive queries

2016-04-18 Thread Shirish Tatikonda
I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am
in fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each
hive query is computing *distinct* on a toy table of 10 records -- so, we
are definitely not running into problems like resource contention. Also, I
increased (streaming) mappers' task timeout value (to 1hr) so that I give
ample time for shell script (i.e., hive query) to finish. So,
architecturally, is there something that limits us spawning such nested MR
jobs -- a streaming MR job spawning multiple hive queries that in turn
spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> My $0.02
>
>
>
> If you are running multiple concurrent queries on the data, you are
> probably doing it wrong (or at least inefficiently)although this
> somewhat depends on what type of files are backing your hive warehouse...
>
>
>
> Let's assume that your data is NOT backed by ORC/parquet files, and that
> you are NOT using Tez/Spark as your execution engine
>
>
>
> Generally with HDFS, data I/O is going to be the slowest pieceso, with
> your workflow, each hive query is going to need to read ALL of the source
> data to resolve the query.  It would be much more efficient if you could
> write a more complex query that could read the source data 1 time (instead
> of however many parallel operations you are running)Additionally, from
> an efficiency perspective running queries in parallel might only help
> improve performance if each of your queries requires fewer map tasks than
> the total capacity of your clusterotherwise it would  generally be more
> efficient to run your queries in series.
>
>
>
> If you schedule the work in series, and things get backed up, the job will
> still eventually complete.  If you attempt to do TOO much work in parallel,
> all of the jobs will start timing out and then everything will fail.
>
>
>
> There may be a valid reason for approaching the problem the way that you
> are, but I'd encourage you to look at restructuring your approach to the
> problem to save you more headaches down the road.
>
>
>
> *From:* Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
> *Sent:* Monday, April 18, 2016 2:00 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Mappers spawning Hive queries
>
>
>
> Hi John,
>
>
>
> 2) The shell script is invoked in the mappers of a Hadoop streaming job.
>
>
>
> 1) The use case is that I have to process multiple entities in parallel.
> Each entity is associated with its own data set. The processing involves a
> few hive queries to do joins and aggregations, which is followed by some
> code in Python. My thought process is to put the hive queries and python
> invocation in a shell script, and invoke the shell script on multiple
> entities in parallel through a streaming mapreduce job.
>
>
>
> Shirish
>
>
>
>
>
> On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
> Just out of curiosity, what is the use case behind this?
>
> How do you call the shell script?
>
>
> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatiko...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to run multiple hive queries in parallel by submitting them
> through a map-reduce job.
> > More specifically, I have a map-only hadoop streaming job where each
> mapper runs a shell script that does two things -- 1) parses input lines
> obtained via streaming; and 2) submits a very simple hive query (via hive
> -e ...) with parameters computed from step-1.
> >
> > Now, when I run the streaming job, the mappers seem to be stuck and I
> don't know what is going on. When I looked on resource manager web UI, I
> don't see any new MR Jobs (triggered from the hive query). I am trying to
> understand this behavior.
> >
> > This may be a bad idea to begin with, and there may be better ways to
> accomplish the same task. However, I would like to understand the behavior
> of such a MR job.
> >
> > Any thoughts?
> >
> > Thank you,
> > Shirish
> >
>
>
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
My $0.02

If you are running multiple concurrent queries on the data, you are probably 
doing it wrong (or at least inefficiently)although this somewhat depends on 
what type of files are backing your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you 
are NOT using Tez/Spark as your execution engine

Generally with HDFS, data I/O is going to be the slowest pieceso, with your 
workflow, each hive query is going to need to read ALL of the source data to 
resolve the query.  It would be much more efficient if you could write a more 
complex query that could read the source data 1 time (instead of however many 
parallel operations you are running)Additionally, from an efficiency 
perspective running queries in parallel might only help improve performance if 
each of your queries requires fewer map tasks than the total capacity of your 
clusterotherwise it would  generally be more efficient to run your queries 
in series.

If you schedule the work in series, and things get backed up, the job will 
still eventually complete.  If you attempt to do TOO much work in parallel, all 
of the jobs will start timing out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, 
but I'd encourage you to look at restructuring your approach to the problem to 
save you more headaches down the road.

From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each 
entity is associated with its own data set. The processing involves a few hive 
queries to do joins and aggregations, which is followed by some code in Python. 
My thought process is to put the hive queries and python invocation in a shell 
script, and invoke the shell script on multiple entities in parallel through a 
streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job.
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same task. However, I would like to understand the behavior of 
> such a MR job.
>
> Any thoughts?
>
> Thank you,
> Shirish
>


==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


Re: Mappers spawning Hive queries

2016-04-18 Thread Mich Talebzadeh
What is the version of Hive and the execution engine (MR, Tez, Spark)?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 20:59, Shirish Tatikonda <shirish.tatiko...@gmail.com>
wrote:

> Hi John,
>
> 2) The shell script is invoked in the mappers of a Hadoop streaming job.
>
> 1) The use case is that I have to process multiple entities in parallel.
> Each entity is associated with its own data set. The processing involves a
> few hive queries to do joins and aggregations, which is followed by some
> code in Python. My thought process is to put the hive queries and python
> invocation in a shell script, and invoke the shell script on multiple
> entities in parallel through a streaming mapreduce job.
>
> Shirish
>
>
> On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> Just out of curiosity, what is the use case behind this?
>>
>> How do you call the shell script?
>>
>> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <
>> shirish.tatiko...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I am trying to run multiple hive queries in parallel by submitting them
>> through a map-reduce job.
>> > More specifically, I have a map-only hadoop streaming job where each
>> mapper runs a shell script that does two things -- 1) parses input lines
>> obtained via streaming; and 2) submits a very simple hive query (via hive
>> -e ...) with parameters computed from step-1.
>> >
>> > Now, when I run the streaming job, the mappers seem to be stuck and I
>> don't know what is going on. When I looked on resource manager web UI, I
>> don't see any new MR Jobs (triggered from the hive query). I am trying to
>> understand this behavior.
>> >
>> > This may be a bad idea to begin with, and there may be better ways to
>> accomplish the same task. However, I would like to understand the behavior
>> of such a MR job.
>> >
>> > Any thoughts?
>> >
>> > Thank you,
>> > Shirish
>> >
>>
>
>


Re: Mappers spawning Hive queries

2016-04-18 Thread Shirish Tatikonda
Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel.
Each entity is associated with its own data set. The processing involves a
few hive queries to do joins and aggregations, which is followed by some
code in Python. My thought process is to put the hive queries and python
invocation in a shell script, and invoke the shell script on multiple
entities in parallel through a streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Just out of curiosity, what is the use case behind this?
>
> How do you call the shell script?
>
> > On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatiko...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to run multiple hive queries in parallel by submitting them
> through a map-reduce job.
> > More specifically, I have a map-only hadoop streaming job where each
> mapper runs a shell script that does two things -- 1) parses input lines
> obtained via streaming; and 2) submits a very simple hive query (via hive
> -e ...) with parameters computed from step-1.
> >
> > Now, when I run the streaming job, the mappers seem to be stuck and I
> don't know what is going on. When I looked on resource manager web UI, I
> don't see any new MR Jobs (triggered from the hive query). I am trying to
> understand this behavior.
> >
> > This may be a bad idea to begin with, and there may be better ways to
> accomplish the same task. However, I would like to understand the behavior
> of such a MR job.
> >
> > Any thoughts?
> >
> > Thank you,
> > Shirish
> >
>


Re: Mappers spawning Hive queries

2016-04-16 Thread Jörn Franke
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatiko...@gmail.com> 
> wrote:
> 
> Hello,
> 
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job. 
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1. 
> 
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior. 
> 
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same task. However, I would like to understand the behavior of 
> such a MR job.
> 
> Any thoughts?
> 
> Thank you,
> Shirish
> 


Mappers spawning Hive queries

2016-04-15 Thread Shirish Tatikonda
Hello,

I am trying to run multiple hive queries in parallel by submitting them
through a map-reduce job.
More specifically, I have a map-only hadoop streaming job where each mapper
runs a shell script that does two things -- 1) parses input lines obtained
via streaming; and 2) submits a very simple hive query (via hive -e ...)
with parameters computed from step-1.

Now, when I run the streaming job, the mappers seem to be stuck and I don't
know what is going on. When I looked on resource manager web UI, I don't
see any new MR Jobs (triggered from the hive query). I am trying to
understand this behavior.

This may be a bad idea to begin with, and there may be better ways to
accomplish the same task. However, I would like to understand the behavior
of such a MR job.

Any thoughts?

Thank you,
Shirish


Re: Running hive queries in different queue

2016-02-28 Thread Rajit Saha
Thanks a lot Sathi.


I also found in the Hive Execution Engine is MapReduce
set mapreduce.job.queuename=; works

If the Hive Execution Engine is Tez
We need to do
set tez.queue.name=;



Cheers

Rajit Saha

Principal DevOps Engineer | BigData
LendingClub




From: Sathi Chowdhury 
<sathi.chowdh...@lithium.com<mailto:sathi.chowdh...@lithium.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Friday, February 26, 2016 at 6:01 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: Running hive queries in different queue

I think  in your hive script you can do
set mapreduce.job.queuename=;
Thanks
Sathi

From: Rajit Saha
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>"
Date: Friday, February 26, 2016 at 5:34 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>"
Subject: Running hive queries in different queue

Hi

I want to run hive query in a queue others than "default" queue from hive 
client command line . Can anybody please suggest a way to do it.

Regards
Rajit

On Feb 26, 2016, at 07:36, Patrick Duin 
<patd...@gmail.com<mailto:patd...@gmail.com>> wrote:

Hi Prasanth.

Thanks for the quick reply!

The logs don't show much more of the stacktrace I'm afraid:
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The stacktrace isn't really the issue though. The NullPointer is a symptom 
caused by not being able to return any stripes, if you look at the line in the 
code it is  because the 'stripes' field is null which should never happen. 
This, we think, is caused by failing namenode network traffic. We would have 
lots of IO warning in the logs saying block's cannot be found or e.g.:
16/02/01 13:20:34 WARN hdfs.BlockReaderFactory: I/O error constructing remote 
block reader.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy32.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:268)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy33.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1007)
at 
org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:2062)
at 
org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:2068)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:208)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:159)
at 
org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
at 
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3123)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:407)
at 
org.apache.hadoop.hive.ql.

Re: Running hive queries in different queue

2016-02-27 Thread Mich Talebzadeh
Hello.

What Hive client are you using? beeline

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 February 2016 at 01:34, Rajit Saha  wrote:

> Hi
>
> I want to run hive query in a queue others than "default" queue from hive
> client command line . Can anybody please suggest a way to do it.
>
> Regards
> Rajit
>
> On Feb 26, 2016, at 07:36, Patrick Duin  wrote:
>
> Hi Prasanth.
>
> Thanks for the quick reply!
>
> The logs don't show much more of the stacktrace I'm afraid:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> The stacktrace isn't really the issue though. The NullPointer is a symptom
> caused by not being able to return any stripes, if you look at the line in
> the code it is  because the 'stripes' field is null which should never
> happen. This, we think, is caused by failing namenode network traffic. We
> would have lots of IO warning in the logs saying block's cannot be found or
> e.g.:
> 16/02/01 13:20:34 WARN hdfs.BlockReaderFactory: I/O error constructing
> remote block reader.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.ipc.Client.call(Client.java:1448)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy32.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:268)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy33.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1007)
> at
> org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:2062)
> at
> org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:2068)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:208)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:159)
> at
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
> at
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3123)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:407)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:311)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:885)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:771)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
> at 

Re: Running hive queries in different queue

2016-02-26 Thread Sathi Chowdhury
I think  in your hive script you can do
set mapreduce.job.queuename=;
Thanks
Sathi

From: Rajit Saha
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>"
Date: Friday, February 26, 2016 at 5:34 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>"
Subject: Running hive queries in different queue

Hi

I want to run hive query in a queue others than "default" queue from hive 
client command line . Can anybody please suggest a way to do it.

Regards
Rajit

On Feb 26, 2016, at 07:36, Patrick Duin 
<patd...@gmail.com<mailto:patd...@gmail.com>> wrote:

Hi Prasanth.

Thanks for the quick reply!

The logs don't show much more of the stacktrace I'm afraid:
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The stacktrace isn't really the issue though. The NullPointer is a symptom 
caused by not being able to return any stripes, if you look at the line in the 
code it is  because the 'stripes' field is null which should never happen. 
This, we think, is caused by failing namenode network traffic. We would have 
lots of IO warning in the logs saying block's cannot be found or e.g.:
16/02/01 13:20:34 WARN hdfs.BlockReaderFactory: I/O error constructing remote 
block reader.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy32.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:268)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy33.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1007)
at 
org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:2062)
at 
org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:2068)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:208)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:159)
at 
org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
at 
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3123)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:407)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:311)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:885)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:771)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1047)
at org.apache.h

Running hive queries in different queue

2016-02-26 Thread Rajit Saha
Hi

I want to run hive query in a queue others than "default" queue from hive 
client command line . Can anybody please suggest a way to do it.

Regards
Rajit

On Feb 26, 2016, at 07:36, Patrick Duin 
> wrote:

Hi Prasanth.

Thanks for the quick reply!

The logs don't show much more of the stacktrace I'm afraid:
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The stacktrace isn't really the issue though. The NullPointer is a symptom 
caused by not being able to return any stripes, if you look at the line in the 
code it is  because the 'stripes' field is null which should never happen. 
This, we think, is caused by failing namenode network traffic. We would have 
lots of IO warning in the logs saying block's cannot be found or e.g.:
16/02/01 13:20:34 WARN hdfs.BlockReaderFactory: I/O error constructing remote 
block reader.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy32.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:268)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy33.getServerDefaults(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1007)
at 
org.apache.hadoop.hdfs.DFSClient.shouldEncryptData(DFSClient.java:2062)
at 
org.apache.hadoop.hdfs.DFSClient.newDataEncryptionKey(DFSClient.java:2068)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:208)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:159)
at 
org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
at 
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3123)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:670)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:337)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:407)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:311)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:885)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:771)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1047)
at org.apache.hadoop.ipc.Client.call(Client.java:1442)
... 33 more

Our job doesn't always fail sometimes splits get calculated. We suspect when 
the namenode is too busy our job maybe hits some time-outs and the whole thing 
fails.

Our intuition has been the same as you suggest, bigger files is better. But we 
see a degradation in performance as soon as our files get 

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-16 Thread Devopam Mittra
+1 for all suggestions provided already.

I have personally use Talend Big Data Studio in conjunction with Hive +
Cron/Autosys to build and manage small DW.
Found it easy to rapidly build and deploy. Helps with email integration etc
which was my custom requirement (spool few reports and share via email at
routine intervals).

regards
Dev

On Tue, Feb 16, 2016 at 4:10 PM, Elliot West <tea...@gmail.com> wrote:

> I'd say that so long as you can achieve a similar quality of engineering
> as is possible with other software development domains, then 'yes, it is
> ok'.
>
> Specifically, our Hive projects are packaged as RPMs, built and released
> with Maven, covered by suites of unit tests developed with HiveRunner, and
> part of the same Jenkins CI process as other Java based projects.
> Decomposing large processes into sensible units is not as easy as with
> other frameworks so this may require more thought and care.
>
> More information here:
> https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL
>
> You have many potential alternatives depending on which languages you are
> comfortable using: Pig, Flink, Cascading, Spark, Crunch, Scrunch, Scalding,
> etc.
>
> Elliot.
>
>
> On Tuesday, 16 February 2016, Ramasubramanian <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
>>
>> Data is stored in HIVE. We have transactional and reference data. We need
>> to build a small warehouse.
>>
>> Need suggestion on alternatives too.
>>
>> Regards,
>> Rams
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-16 Thread Elliot West
I'd say that so long as you can achieve a similar quality of engineering as
is possible with other software development domains, then 'yes, it is ok'.

Specifically, our Hive projects are packaged as RPMs, built and released
with Maven, covered by suites of unit tests developed with HiveRunner, and
part of the same Jenkins CI process as other Java based projects.
Decomposing large processes into sensible units is not as easy as with
other frameworks so this may require more thought and care.

More information here:
https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL

You have many potential alternatives depending on which languages you are
comfortable using: Pig, Flink, Cascading, Spark, Crunch, Scrunch, Scalding,
etc.

Elliot.

On Tuesday, 16 February 2016, Ramasubramanian <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
>
> Data is stored in HIVE. We have transactional and reference data. We need
> to build a small warehouse.
>
> Need suggestion on alternatives too.
>
> Regards,
> Rams


Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Mich Talebzadeh
 

A combination of both normally 

See below 

https://www.linkedin.com/pulse/etl-elt-use-case-mich-talebzadeh-ph-d-?trk=pulse_spock-articles
[1] 

HTH. 

Mich 

On 16/02/2016 06:19, Ramasubramanian wrote: 

> Hi,
> 
> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
> 
> Data is stored in HIVE. We have transactional and reference data. We need to 
> build a small warehouse. 
> 
> Need suggestion on alternatives too. 
> 
> Regards,
> Rams

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Links:
--
[1]
https://www.linkedin.com/pulse/etl-elt-use-case-mich-talebzadeh-ph-d-?trk=pulse_spock-articles

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Heng Chen
My company do ETL data flow using HIVE + Pig,   it is OK now.



2016-02-16 14:55 GMT+08:00 Jörn Franke <jornfra...@gmail.com>:

> Why should it not be ok if you do not miss any functionality? You can use
> oozie + hive queries to have more sophisticated logging and scheduling. Do
> not forget to do proper capacity/queue management.
>
> > On 16 Feb 2016, at 07:19, Ramasubramanian <
> ramasubramanian.naraya...@gmail.com> wrote:
> >
> > Hi,
> >
> > Is it ok to build an entire ETL/ELT data flow using HIVE queries?
> >
> > Data is stored in HIVE. We have transactional and reference data. We
> need to build a small warehouse.
> >
> > Need suggestion on alternatives too.
> >
> > Regards,
> > Rams
>


Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Jörn Franke
Why should it not be ok if you do not miss any functionality? You can use oozie 
+ hive queries to have more sophisticated logging and scheduling. Do not forget 
to do proper capacity/queue management.

> On 16 Feb 2016, at 07:19, Ramasubramanian 
> <ramasubramanian.naraya...@gmail.com> wrote:
> 
> Hi,
> 
> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
> 
> Data is stored in HIVE. We have transactional and reference data. We need to 
> build a small warehouse. 
> 
> Need suggestion on alternatives too. 
> 
> Regards,
> Rams


Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-15 Thread Ramasubramanian
Hi,

Is it ok to build an entire ETL/ELT data flow using HIVE queries?

Data is stored in HIVE. We have transactional and reference data. We need to 
build a small warehouse. 

Need suggestion on alternatives too. 

Regards,
Rams

Re: how to set a job name for hive queries

2016-01-23 Thread Artem Ervits
Please see this, its possible with Tez in Hive 1.2.1
https://community.hortonworks.com/questions/9004/naming-tez-hive-session.html
On Jan 19, 2016 11:06 AM, "Frank Luo"  wrote:

> We are in a multi-tenant environment wanting to add a client’s name into
> each job name hence they can be informed/involved when job fails. We can
> easily do that with M/R jobs, but I haven’t figure out a way to do so for
> hive job.
>
>
>
> I googled and found the answer below, but I couldn’t get it to work. Also,
> I assume it is only possible with M/R engine and not possible for TEZ, is
> it right?
>
>
>
>
> http://stackoverflow.com/questions/19036371/how-do-i-control-a-hive-job-name-but-keep-the-stage-info
>
>
>
> Thanks in advance
> [image: ”MerkleONE”] 
>
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> accept liability for any loss or damage caused by software viruses. The
> information contained in this communication may be confidential and may be
> subject to the attorney-client privilege.
>


how to set a job name for hive queries

2016-01-19 Thread Frank Luo
We are in a multi-tenant environment wanting to add a client’s name into each 
job name hence they can be informed/involved when job fails. We can easily do 
that with M/R jobs, but I haven’t figure out a way to do so for hive job.

I googled and found the answer below, but I couldn’t get it to work. Also, I 
assume it is only possible with M/R engine and not possible for TEZ, is it 
right?

http://stackoverflow.com/questions/19036371/how-do-i-control-a-hive-job-name-but-keep-the-stage-info

Thanks in advance
[”MerkleONE”]

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: python libraries to execute or call hive queries

2015-08-31 Thread rakesh sharma
Hi Gopal
Have you tried pyhs2 libraryIt has many useful functions to retrieve the data
thanksrakesh

> Date: Fri, 28 Aug 2015 11:53:20 -0700
> Subject: Re: python libraries to execute or call hive queries
> From: gop...@apache.org
> To: user@hive.apache.org
> 
> 
> > Can anyone suggest any python libraries to call hive queries from python
> >scripts ?
> 
> https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Pyth
> on
> 
> 
> Though I suspect that's out of date.
> 
> https://github.com/t3rmin4t0r/amplab-benchmark/blob/master/runner/run_query
> .py#L604
> 
> 
> is roughly the way to cut-paste that into working form (for hive-13),
> though you've got to use the exact thrift version of the HiveServer2 you
> run against.
> 
> Though, recently I've noticed the SQLAlchemy wrappers to be more
> convenient 
> 
> https://github.com/dropbox/PyHive/blob/master/pyhive/sqlalchemy_hive.py
> 
> 
> Irrespective of the method of access, the only consistent way to talk to
> Hive is over the JDBC interaction layer (Thrift server).
> 
> Launching bin/hive via Subprocess will work, but I've found that reading
> the results out with a regex has more parsing issues than I'd like.
> 
> Cheers,
> Gopal
> 
> 
  

python libraries to execute or call hive queries

2015-08-28 Thread Giri P
Hi All,

Can anyone suggest any python libraries to call hive queries from python
scripts ?


what is the best practice to execute queries from python like using hive
cli , beeline, jdbc etc..,

Thanks
Giri


Re: python libraries to execute or call hive queries

2015-08-28 Thread Gopal Vijayaraghavan

 Can anyone suggest any python libraries to call hive queries from python
scripts ?

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Pyth
on


Though I suspect that's out of date.

https://github.com/t3rmin4t0r/amplab-benchmark/blob/master/runner/run_query
.py#L604


is roughly the way to cut-paste that into working form (for hive-13),
though you've got to use the exact thrift version of the HiveServer2 you
run against.

Though, recently I've noticed the SQLAlchemy wrappers to be more
convenient 

https://github.com/dropbox/PyHive/blob/master/pyhive/sqlalchemy_hive.py


Irrespective of the method of access, the only consistent way to talk to
Hive is over the JDBC interaction layer (Thrift server).

Launching bin/hive via Subprocess will work, but I've found that reading
the results out with a regex has more parsing issues than I'd like.

Cheers,
Gopal




Re: How to match three letter month name in Hive queries

2014-12-21 Thread Furcy Pin
Hi Thimut,

I believe that the UDF unix_timestamp uses the java class SimpleDateFormat.
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

From the doc, you can see that m denotes a minute while M denotes a
month.
For your problem, -MMM-dd should do the trick.

Regards,

Furcy


2014-12-21 4:36 GMT+01:00 Thimuth Amarakoon thim...@gmail.com:

 Hi,

 How can we match a date value like *2014-Dec-20* in unix_timestamp()? The
 pattern *-MM-dd* matches 2014-12-20 format. But -mmm-dd or
 -m-dd is not doing the trick for matching the month name.

 Thanks and regards,
 Thimuth



Re: How to match three letter month name in Hive queries

2014-12-21 Thread Thimuth Amarakoon
Thanks a lot Furcy. It works.

Regards,
Thimuth

On Sun, Dec 21, 2014 at 4:34 PM, Furcy Pin furcy@flaminem.com wrote:

 Hi Thimut,

 I believe that the UDF unix_timestamp uses the java class SimpleDateFormat.
 http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

 From the doc, you can see that m denotes a minute while M denotes a
 month.
 For your problem, -MMM-dd should do the trick.

 Regards,

 Furcy


 2014-12-21 4:36 GMT+01:00 Thimuth Amarakoon thim...@gmail.com:

 Hi,

 How can we match a date value like *2014-Dec-20* in unix_timestamp()?
 The pattern *-MM-dd* matches 2014-12-20 format. But -mmm-dd or
 -m-dd is not doing the trick for matching the month name.

 Thanks and regards,
 Thimuth





How to match three letter month name in Hive queries

2014-12-20 Thread Thimuth Amarakoon
Hi,

How can we match a date value like *2014-Dec-20* in unix_timestamp()? The
pattern *-MM-dd* matches 2014-12-20 format. But -mmm-dd or
-m-dd is not doing the trick for matching the month name.

Thanks and regards,
Thimuth


Re: Hive queries returning all NULL values.

2014-08-26 Thread Tor Ivry
Raymond - you were the closest.
Parquet field names contained '::' ex. bag1::user_name

Hope it will help anyone in the future

Thanks for all your help

Tor



On Sun, Aug 17, 2014 at 7:50 PM, Raymond Lau raymond.lau...@gmail.com
wrote:

 Do your field names in your parquet files contain upper case letters by
 any chance ex. userName?  Hive will not read the data of external tables if
 they are not completely lower case field names, it doesn't convert them
 properly in the case of external tables.
 On Aug 17, 2014 8:00 AM, hadoop hive hadooph...@gmail.com wrote:

 Take a small set of data like 2-5 line and insert it...

 After that you can try insert first 10 column and then next 10 till you
 fund your problematic column
 On Aug 17, 2014 8:37 PM, Tor Ivry tork...@gmail.com wrote:

 Is there any way to debug this?

 We are talking about many fields here.
 How can I see which field has the mismatch?



 On Sun, Aug 17, 2014 at 4:30 PM, hadoop hive hadooph...@gmail.com
 wrote:

 Hi,

 You check the data type you have provided while creating external
 table, it should match with data in files.

 Thanks
 Vikas Srivastava
 On Aug 17, 2014 7:07 PM, Tor Ivry tork...@gmail.com wrote:

  Hi



 I have a hive (0.11) table with the following create syntax:



 CREATE EXTERNAL TABLE events(

 …

 )

 PARTITIONED BY(dt string)

   ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

   STORED AS

 INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

 OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

 LOCATION '/data-events/success’;



 Query runs fine.


 I add hdfs partitions (containing snappy.parquet files).



 When I run

 hive

  select count(*) from events where dt=“20140815”

 I get the correct result



 *Problem:*

 When I run

 hive

  select * from events where dt=“20140815” limit 1;

 I get

 OK

 NULL NULL NULL NULL NULL NULL NULL 20140815



 *The same query in Impala returns the correct values.*



 Any idea what could be the issue?



 Thanks

 Tor





Hive queries returning all NULL values.

2014-08-17 Thread Tor Ivry
Hi



I have a hive (0.11) table with the following create syntax:



CREATE EXTERNAL TABLE events(

…

)

PARTITIONED BY(dt string)

  ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

  STORED AS

INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

LOCATION '/data-events/success’;



Query runs fine.


I add hdfs partitions (containing snappy.parquet files).



When I run

hive

 select count(*) from events where dt=“20140815”

I get the correct result



*Problem:*

When I run

hive

 select * from events where dt=“20140815” limit 1;

I get

OK

NULL NULL NULL NULL NULL NULL NULL 20140815



*The same query in Impala returns the correct values.*



Any idea what could be the issue?



Thanks

Tor


Hive queries returning all NULL values.

2014-08-17 Thread Tor Ivry
Hi



I have a hive (0.11) table with the following create syntax:



CREATE EXTERNAL TABLE events(

…

)

PARTITIONED BY(dt string)

  ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

  STORED AS

INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

LOCATION '/data-events/success’;



Query runs fine.


I add hdfs partitions (containing snappy.parquet files).



When I run

hive

 select count(*) from events where dt=“20140815”

I get the correct result



*Problem:*

When I run

hive

 select * from events where dt=“20140815” limit 1;

I get

OK

NULL NULL NULL NULL NULL NULL NULL 20140815



*The same query in Impala returns the correct values.*



Any idea what could be the issue?



Thanks

Tor


Re: Hive queries returning all NULL values.

2014-08-17 Thread hadoop hive
Hi,

You check the data type you have provided while creating external table, it
should match with data in files.

Thanks
Vikas Srivastava
On Aug 17, 2014 7:07 PM, Tor Ivry tork...@gmail.com wrote:

 Hi



 I have a hive (0.11) table with the following create syntax:



 CREATE EXTERNAL TABLE events(

 …

 )

 PARTITIONED BY(dt string)

   ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

   STORED AS

 INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

 OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

 LOCATION '/data-events/success’;



 Query runs fine.


 I add hdfs partitions (containing snappy.parquet files).



 When I run

 hive

  select count(*) from events where dt=“20140815”

 I get the correct result



 *Problem:*

 When I run

 hive

  select * from events where dt=“20140815” limit 1;

 I get

 OK

 NULL NULL NULL NULL NULL NULL NULL 20140815



 *The same query in Impala returns the correct values.*



 Any idea what could be the issue?



 Thanks

 Tor



Re: Hive queries returning all NULL values.

2014-08-17 Thread Tor Ivry
Is there any way to debug this?

We are talking about many fields here.
How can I see which field has the mismatch?



On Sun, Aug 17, 2014 at 4:30 PM, hadoop hive hadooph...@gmail.com wrote:

 Hi,

 You check the data type you have provided while creating external table,
 it should match with data in files.

 Thanks
 Vikas Srivastava
 On Aug 17, 2014 7:07 PM, Tor Ivry tork...@gmail.com wrote:

  Hi



 I have a hive (0.11) table with the following create syntax:



 CREATE EXTERNAL TABLE events(

 …

 )

 PARTITIONED BY(dt string)

   ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

   STORED AS

 INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

 OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

 LOCATION '/data-events/success’;



 Query runs fine.


 I add hdfs partitions (containing snappy.parquet files).



 When I run

 hive

  select count(*) from events where dt=“20140815”

 I get the correct result



 *Problem:*

 When I run

 hive

  select * from events where dt=“20140815” limit 1;

 I get

 OK

 NULL NULL NULL NULL NULL NULL NULL 20140815



 *The same query in Impala returns the correct values.*



 Any idea what could be the issue?



 Thanks

 Tor




Re: Hive queries returning all NULL values.

2014-08-17 Thread hadoop hive
Take a small set of data like 2-5 line and insert it...

After that you can try insert first 10 column and then next 10 till you
fund your problematic column
On Aug 17, 2014 8:37 PM, Tor Ivry tork...@gmail.com wrote:

 Is there any way to debug this?

 We are talking about many fields here.
 How can I see which field has the mismatch?



 On Sun, Aug 17, 2014 at 4:30 PM, hadoop hive hadooph...@gmail.com wrote:

 Hi,

 You check the data type you have provided while creating external table,
 it should match with data in files.

 Thanks
 Vikas Srivastava
 On Aug 17, 2014 7:07 PM, Tor Ivry tork...@gmail.com wrote:

  Hi



 I have a hive (0.11) table with the following create syntax:



 CREATE EXTERNAL TABLE events(

 …

 )

 PARTITIONED BY(dt string)

   ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

   STORED AS

 INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

 OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

 LOCATION '/data-events/success’;



 Query runs fine.


 I add hdfs partitions (containing snappy.parquet files).



 When I run

 hive

  select count(*) from events where dt=“20140815”

 I get the correct result



 *Problem:*

 When I run

 hive

  select * from events where dt=“20140815” limit 1;

 I get

 OK

 NULL NULL NULL NULL NULL NULL NULL 20140815



 *The same query in Impala returns the correct values.*



 Any idea what could be the issue?



 Thanks

 Tor





Re: Hive queries returning all NULL values.

2014-08-17 Thread Raymond Lau
Do your field names in your parquet files contain upper case letters by any
chance ex. userName?  Hive will not read the data of external tables if
they are not completely lower case field names, it doesn't convert them
properly in the case of external tables.
On Aug 17, 2014 8:00 AM, hadoop hive hadooph...@gmail.com wrote:

 Take a small set of data like 2-5 line and insert it...

 After that you can try insert first 10 column and then next 10 till you
 fund your problematic column
 On Aug 17, 2014 8:37 PM, Tor Ivry tork...@gmail.com wrote:

 Is there any way to debug this?

 We are talking about many fields here.
 How can I see which field has the mismatch?



 On Sun, Aug 17, 2014 at 4:30 PM, hadoop hive hadooph...@gmail.com
 wrote:

 Hi,

 You check the data type you have provided while creating external table,
 it should match with data in files.

 Thanks
 Vikas Srivastava
 On Aug 17, 2014 7:07 PM, Tor Ivry tork...@gmail.com wrote:

  Hi



 I have a hive (0.11) table with the following create syntax:



 CREATE EXTERNAL TABLE events(

 …

 )

 PARTITIONED BY(dt string)

   ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'

   STORED AS

 INPUTFORMAT parquet.hive.DeprecatedParquetInputFormat

 OUTPUTFORMAT parquet.hive.DeprecatedParquetOutputFormat

 LOCATION '/data-events/success’;



 Query runs fine.


 I add hdfs partitions (containing snappy.parquet files).



 When I run

 hive

  select count(*) from events where dt=“20140815”

 I get the correct result



 *Problem:*

 When I run

 hive

  select * from events where dt=“20140815” limit 1;

 I get

 OK

 NULL NULL NULL NULL NULL NULL NULL 20140815



 *The same query in Impala returns the correct values.*



 Any idea what could be the issue?



 Thanks

 Tor





Re: Query execution time for Hive queries in Hue Web UI

2014-06-25 Thread Stéphane Verlet
in the result page click on the m/r job on the left , then click on metadata

Stephane


On Mon, Jun 23, 2014 at 3:42 AM, Ravi Prasad raviprasa...@gmail.com wrote:

 Hi all,

  I have created the Hive table  (  million of records).
 I am using Hue Web UI  to run the Hive queires

 I am running the same queries in both Hive UI ( Beeswax)   and  Cloudera
 Impala ( Web UI)  in Hue to find our the performance.

 In the Hue, I am not able to find the  query execution time .
 Can someone help on this How to find the execution time of the queries in
 Hue.



 --
 Regards,
 RAVI PRASAD. T



Query execution time for Hive queries in Hue Web UI

2014-06-23 Thread Ravi Prasad
Hi all,

 I have created the Hive table  (  million of records).
I am using Hue Web UI  to run the Hive queires

I am running the same queries in both Hive UI ( Beeswax)   and  Cloudera
Impala ( Web UI)  in Hue to find our the performance.

In the Hue, I am not able to find the  query execution time .
Can someone help on this How to find the execution time of the queries in
Hue.



--
Regards,
RAVI PRASAD. T


Re: Executing Hive Queries in Parallel

2014-04-27 Thread Swagatika Tripathy
Hi,
You can also use oozie's fork fearure  which acts as a workflow scheduler
to run jobs in parallel. You just need to define all our hql's inside the
workflow.XML to make it run in parallel.
On Apr 22, 2014 3:14 AM, Subramanian, Sanjay (HQP) 
sanjay.subraman...@roberthalf.com wrote:

   Hey

  Instead of going into HIVE CLI
  I would propose 2 ways

  *NOHUP *
  nohup hive -f path/to/query/file/*hive1.hql*  ./hive1.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive2.hql*  ./hive2.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive3.hql*  ./hive3.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive4.hql*  ./hive4.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive5.hql*  ./hive5.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21

  Each statement above will launch MR jobs on your cluster and depending
 on the cluster configs the jobs will run parallelly
  Scheduling jobs on the MR cluster is independent of Hive

  *SCREEN sessions*

- Create a Screen session
   - screen  –S  hive_query1
   - U r inside the screen session hive_query1
  - hive -f path/to/query/file/*hive1.hql*
   - Ctrl A D
  - U detach from a screen session
- Repeat for each hive query u want to run
   - I.e. Say 5 screen sessions, each running a have query
- To display screen session active
   - screen -x
- To attach to a screen session
   - screen  -x hive_query1


  Thanks

 Warm Regards


  Sanjay


From: saurabh mpp.databa...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Monday, April 21, 2014 at 1:53 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Executing Hive Queries in Parallel


  Hi,
  I need some inputs to execute hive queries in parallel. I tried doing
 this using CLI (by opening multiple ssh connection) and executed 4 HQL's;
 it was observed that the queries are getting executed sequentially. All the
 FOUR queries got submitted however while the first one was in execution
 mode the other were in pending state. I was performing this activity on the
 EMR running on Batch mode hence didn't able to dig into the logs.

  The hive CLI uses native hive connection which by default uses the FIFO
 scheduler.  This might be one of the reason for the queries getting
 executed in sequence.

  I also observed that when multiple queries are executed using multiple
 HUE sessions, it provides the parallel execution functionality. Can you
 please suggest how the functionality of HUE can be replicated using CLI?

  I am aware of beeswax client however i am not sure how this can be used
 during EMR- batch mode processing.

  Thanks in advance for going through this. Kindly let me know your
 thoughts on the same.




Re: Executing Hive Queries in Parallel

2014-04-27 Thread Manish Malhotra
What Sanjay and Swagatika replied are perfect.

Plus fundamentally if you see, if you are able to run the hive query from
CLI or some internal API like HiveDriver, the flow will be this:

 Compile the query
 Get the info from Hive Metastore using Thrift or JDBC, Optimize it ( if
required and can do)
 Generate the Java MR code.
 Push the jobs ( might need to execute more then 1 in sequence) to the
JobTracker
Now the final step make sure that these MR job runs in parallel based on
the Queue and availability of the MR slots on the cluster.

So, irrespective you are running query using nohup hive -q or from multiple
machines or Oozie or Your custom code.
It boils down to your system/code is not submitting query in sequence or
not waiting and your cluster has enough resource to run MR in parallel.

Regards,
Manish



On Sun, Apr 27, 2014 at 1:58 PM, Swagatika Tripathy swagatikat...@gmail.com
 wrote:

 Hi,
 You can also use oozie's fork fearure  which acts as a workflow scheduler
 to run jobs in parallel. You just need to define all our hql's inside the
 workflow.XML to make it run in parallel.
 On Apr 22, 2014 3:14 AM, Subramanian, Sanjay (HQP) 
 sanjay.subraman...@roberthalf.com wrote:

   Hey

  Instead of going into HIVE CLI
  I would propose 2 ways

  *NOHUP *
  nohup hive -f path/to/query/file/*hive1.hql*  ./hive1.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive2.hql*  ./hive2.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive3.hql*  ./hive3.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive4.hql*  ./hive4.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21
  nohup hive -f path/to/query/file/*hive5.hql*  ./hive5.hql_`date
 +%Y-%m-%d-%H–%M–%S`.log 21

  Each statement above will launch MR jobs on your cluster and depending
 on the cluster configs the jobs will run parallelly
  Scheduling jobs on the MR cluster is independent of Hive

  *SCREEN sessions*

- Create a Screen session
   - screen  –S  hive_query1
   - U r inside the screen session hive_query1
  - hive -f path/to/query/file/*hive1.hql*
   - Ctrl A D
  - U detach from a screen session
- Repeat for each hive query u want to run
   - I.e. Say 5 screen sessions, each running a have query
- To display screen session active
   - screen -x
- To attach to a screen session
   - screen  -x hive_query1


  Thanks

 Warm Regards


  Sanjay


From: saurabh mpp.databa...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Monday, April 21, 2014 at 1:53 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Executing Hive Queries in Parallel


  Hi,
  I need some inputs to execute hive queries in parallel. I tried doing
 this using CLI (by opening multiple ssh connection) and executed 4 HQL's;
 it was observed that the queries are getting executed sequentially. All the
 FOUR queries got submitted however while the first one was in execution
 mode the other were in pending state. I was performing this activity on the
 EMR running on Batch mode hence didn't able to dig into the logs.

  The hive CLI uses native hive connection which by default uses the FIFO
 scheduler.  This might be one of the reason for the queries getting
 executed in sequence.

  I also observed that when multiple queries are executed using multiple
 HUE sessions, it provides the parallel execution functionality. Can you
 please suggest how the functionality of HUE can be replicated using CLI?

  I am aware of beeswax client however i am not sure how this can be used
 during EMR- batch mode processing.

  Thanks in advance for going through this. Kindly let me know your
 thoughts on the same.




Executing Hive Queries in Parallel

2014-04-21 Thread saurabh
Hi,
I need some inputs to execute hive queries in parallel. I tried doing this
using CLI (by opening multiple ssh connection) and executed 4 HQL's; it was
observed that the queries are getting executed sequentially. All the FOUR
queries got submitted however while the first one was in execution mode the
other were in pending state. I was performing this activity on the EMR
running on Batch mode hence didn't able to dig into the logs.

The hive CLI uses native hive connection which by default uses the FIFO
scheduler.  This might be one of the reason for the queries getting
executed in sequence.

I also observed that when multiple queries are executed using multiple HUE
sessions, it provides the parallel execution functionality. Can you please
suggest how the functionality of HUE can be replicated using CLI?

I am aware of beeswax client however i am not sure how this can be used
during EMR- batch mode processing.

Thanks in advance for going through this. Kindly let me know your thoughts
on the same.


Re: Executing Hive Queries in Parallel

2014-04-21 Thread Subramanian, Sanjay (HQP)
Hey

Instead of going into HIVE CLI
I would propose 2 ways

NOHUP
nohup hive -f path/to/query/file/hive1.hql  ./hive1.hql_`date 
+%Y-%m-%d-%H–%M–%S`.log 21
nohup hive -f path/to/query/file/hive2.hql  ./hive2.hql_`date 
+%Y-%m-%d-%H–%M–%S`.log 21
nohup hive -f path/to/query/file/hive3.hql  ./hive3.hql_`date 
+%Y-%m-%d-%H–%M–%S`.log 21
nohup hive -f path/to/query/file/hive4.hql  ./hive4.hql_`date 
+%Y-%m-%d-%H–%M–%S`.log 21
nohup hive -f path/to/query/file/hive5.hql  ./hive5.hql_`date 
+%Y-%m-%d-%H–%M–%S`.log 21

Each statement above will launch MR jobs on your cluster and depending on the 
cluster configs the jobs will run parallelly
Scheduling jobs on the MR cluster is independent of Hive

SCREEN sessions

  *   Create a Screen session
 *   screen  –S  hive_query1
 *   U r inside the screen session hive_query1
*   hive -f path/to/query/file/hive1.hql
 *   Ctrl A D
*   U detach from a screen session
  *   Repeat for each hive query u want to run
 *   I.e. Say 5 screen sessions, each running a have query
  *   To display screen session active
 *   screen -x
  *   To attach to a screen session
 *   screen  -x hive_query1

Thanks
Warm Regards

Sanjay

From: saurabh mpp.databa...@gmail.commailto:mpp.databa...@gmail.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Monday, April 21, 2014 at 1:53 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Executing Hive Queries in Parallel


Hi,
I need some inputs to execute hive queries in parallel. I tried doing this 
using CLI (by opening multiple ssh connection) and executed 4 HQL's; it was 
observed that the queries are getting executed sequentially. All the FOUR 
queries got submitted however while the first one was in execution mode the 
other were in pending state. I was performing this activity on the EMR running 
on Batch mode hence didn't able to dig into the logs.

The hive CLI uses native hive connection which by default uses the FIFO 
scheduler.  This might be one of the reason for the queries getting executed in 
sequence.

I also observed that when multiple queries are executed using multiple HUE 
sessions, it provides the parallel execution functionality. Can you please 
suggest how the functionality of HUE can be replicated using CLI?

I am aware of beeswax client however i am not sure how this can be used during 
EMR- batch mode processing.

Thanks in advance for going through this. Kindly let me know your thoughts on 
the same.



Tuning Hive queries that uses underlying HBase Table

2014-02-20 Thread Manjula mohapatra
I am querying Hive table ( mapped to HBase Table ) .

What are the techniques to tune the Hive query and to avoid HBase scans.

Query uses multiple SPLIT and SUBSTR functions and WHERE  condition
something like

select  col1, col2, ...,count(*)
from hiveTable

where split( col1)[0]  timestamp1  and split( col1)[0]timestamp2
group by 


Hive queries for disk usage analysis

2014-02-04 Thread Mungre,Surbhi
Hello All,

We are doing some analysis for which we need to determine things like size of 
the largest row or size of the largest column. By size, I am referring to disk 
space usage. Does HIVE provide any functions to run such queries?

Thanks,
Surbhi Mungre
Software Engineer
www.cerner.comhttp://www.cerner.com/

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: Formatting hive queries

2014-01-22 Thread John Meagher
I use vim and https://github.com/vim-scripts/SQLUtilities to do it.
It's not hive specific.  Any SQL formatting tool will work.

On Tue, Jan 21, 2014 at 11:23 PM, pandees waran pande...@gmail.com wrote:
 Hi,

 I would like to come up with a code which automatically formats your hql
 files.
 Because, formatting is one of the tedious task and i would like to  come up
 with an utility for that.
 Please let me know, whether any  specific utilities exist already for
 formatting hive queries.

 --
 Thanks,
 Pandeeswaran


Formatting hive queries

2014-01-21 Thread pandees waran
Hi,

I would like to come up with a code which automatically formats your hql
files.
Because, formatting is one of the tedious task and i would like to  come up
with an utility for that.
Please let me know, whether any  specific utilities exist already for
formatting hive queries.

-- 
Thanks,
Pandeeswaran


Re: User accounts to execute hive queries

2013-09-19 Thread Rudra Tripathy
Thanks Nitin for the help, I would try.

Thanks and Regards,
Rudra

On Wed, Sep 18, 2013 at 5:14 PM, Thejas Nair the...@hortonworks.com wrote:

 You might find my slides on this topic useful -
 http://www.slideshare.net/thejasmn/hive-authorization-models

 Also linked from last slide  -

 https://cwiki.apache.org/confluence/display/HCATALOG/Storage+Based+Authorization

 On Tue, Sep 17, 2013 at 11:46 PM, Nitin Pawar nitinpawar...@gmail.com
 wrote:
  The link I gave in previous mail explains how can you user level
  authorizations in hive.
 
 
 
  On Mon, Sep 16, 2013 at 7:57 PM, shouvanik.hal...@accenture.com wrote:
 
  Hi Nitin,
 
 
 
  I want it secured.
 
 
 
  Yes, I would like to give specific access to specific users. E.g.
 “select
  * from” access to some and “add/modify/delete” options to some
 
 
 
 
 
  “What kind of security do you have on hdfs? “
 
  I could not follow this question
 
 
 
  Thanks,
 
  Shouvanik
 
  From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
  Sent: Monday, September 16, 2013 6:50 PM
  To: Haldar, Shouvanik
  Cc: user@hive.apache.org
  Subject: Re: User accounts to execute hive queries
 
 
 
  You will need to tell few more things.
 
  Do you want it secured?
 
  Do you distinguish users in different categories on what one particular
  user can do or not?
 
  What kind of security do you have on hdfs?
 
 
 
 
 
  It is definitely possible for users to run queries on their own username
  but then you have to take few measures as well.
 
  which user can do what action. Which user can access what location on
 hdfs
  etc
 
 
 
  For user management on hive side you can read at
  https://cwiki.apache.org/Hive/languagemanual-authorization.html
 
 
 
  if you do not want to go through the secure way,
 
  then add all the users to one group and then grant permissions to that
  group on your warehouse directory.
 
 
 
  other way if the table data is not shared then,
 
  create individual directory for each user on hdfs and give only that
 user
  access to that directory.
 
 
  
  This message is for the designated recipient only and may contain
  privileged, proprietary, or otherwise confidential information. If you
 have
  received it in error, please notify the sender immediately and delete
 the
  original. Any other use of the e-mail by you is prohibited.
 
  Where allowed by local law, electronic communications with Accenture and
  its affiliates, including e-mail and instant messaging (including
 content),
  may be scanned by our systems for the purposes of information security
 and
  assessment of internal compliance with Accenture policy.
 
 
 
 __
 
  www.accenture.com
 
 
 
 
  --
  Nitin Pawar

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: User accounts to execute hive queries

2013-09-18 Thread Nitin Pawar
The link I gave in previous mail explains how can you user level
authorizations in hive.



On Mon, Sep 16, 2013 at 7:57 PM, shouvanik.hal...@accenture.com wrote:

  Hi Nitin,

 ** **

 I want it secured.

 ** **

 Yes, I would like to give specific access to specific users. E.g. “select
 * from” access to some and “add/modify/delete” options to some

 ** **

 ** **

 “What kind of security do you have on hdfs? “

 I could not follow this question

 ** **

 Thanks,

 Shouvanik

 *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
 *Sent:* Monday, September 16, 2013 6:50 PM
 *To:* Haldar, Shouvanik
 *Cc:* user@hive.apache.org
 *Subject:* Re: User accounts to execute hive queries

 ** **

 You will need to tell few more things. 

 Do you want it secured? 

 Do you distinguish users in different categories on what one particular
 user can do or not? 

 What kind of security do you have on hdfs? 

 ** **

 ** **

 It is definitely possible for users to run queries on their own username
 but then you have to take few measures as well. 

 which user can do what action. Which user can access what location on hdfs
 etc 

 ** **

 For user management on hive side you can read at
 https://cwiki.apache.org/Hive/languagemanual-authorization.html

 ** **

 if you do not want to go through the secure way, 

 then add all the users to one group and then grant permissions to that
 group on your warehouse directory. 

 ** **

 other way if the table data is not shared then, 

 create individual directory for each user on hdfs and give only that user
 access to that directory. 

 --
 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited.

 Where allowed by local law, electronic communications with Accenture and
 its affiliates, including e-mail and instant messaging (including content),
 may be scanned by our systems for the purposes of information security and
 assessment of internal compliance with Accenture policy.


 __

 www.accenture.com




-- 
Nitin Pawar


Re: User accounts to execute hive queries

2013-09-18 Thread Thejas Nair
You might find my slides on this topic useful -
http://www.slideshare.net/thejasmn/hive-authorization-models

Also linked from last slide  -
https://cwiki.apache.org/confluence/display/HCATALOG/Storage+Based+Authorization

On Tue, Sep 17, 2013 at 11:46 PM, Nitin Pawar nitinpawar...@gmail.com wrote:
 The link I gave in previous mail explains how can you user level
 authorizations in hive.



 On Mon, Sep 16, 2013 at 7:57 PM, shouvanik.hal...@accenture.com wrote:

 Hi Nitin,



 I want it secured.



 Yes, I would like to give specific access to specific users. E.g. “select
 * from” access to some and “add/modify/delete” options to some





 “What kind of security do you have on hdfs? “

 I could not follow this question



 Thanks,

 Shouvanik

 From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
 Sent: Monday, September 16, 2013 6:50 PM
 To: Haldar, Shouvanik
 Cc: user@hive.apache.org
 Subject: Re: User accounts to execute hive queries



 You will need to tell few more things.

 Do you want it secured?

 Do you distinguish users in different categories on what one particular
 user can do or not?

 What kind of security do you have on hdfs?





 It is definitely possible for users to run queries on their own username
 but then you have to take few measures as well.

 which user can do what action. Which user can access what location on hdfs
 etc



 For user management on hive side you can read at
 https://cwiki.apache.org/Hive/languagemanual-authorization.html



 if you do not want to go through the secure way,

 then add all the users to one group and then grant permissions to that
 group on your warehouse directory.



 other way if the table data is not shared then,

 create individual directory for each user on hdfs and give only that user
 access to that directory.


 
 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited.

 Where allowed by local law, electronic communications with Accenture and
 its affiliates, including e-mail and instant messaging (including content),
 may be scanned by our systems for the purposes of information security and
 assessment of internal compliance with Accenture policy.


 __

 www.accenture.com




 --
 Nitin Pawar

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


User accounts to execute hive queries

2013-09-16 Thread shouvanik.haldar
Hi,
Can you please tell me if its possible to execute hive queries as different 
users?

Can we create read-only access for hive?
Please help.
Thanks
Shouvanik

Sent from my Windows Phone

-Original Message-
From: Nitin Pawar nitinpawar...@gmail.com
Sent: ‎16-‎09-‎2013 15:57
To: user@hive.apache.org user@hive.apache.org
Subject: Re: Issue while quering Hive

Does your .gz file contains the data in sequencefile ? or its a plain csv?


I think looking at the filename its a plain csv file, so I would recommend that 
you create a normal table with TextInputFormat (the default) and load data in 
the new table and give it a try.





On Mon, Sep 16, 2013 at 3:36 PM, Garg, Rinku rinku.g...@fisglobal.com wrote:

Hi Nitin,

Yes, I created the table with sequencefile.

Thanks  Regards,
Rinku Garg





From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: 16 September 2013 14:19
To: user@hive.apache.org
Subject: Re: Issue while quering Hive

Look at the error message

Caused by: java.io.IOException: 
hdfs://localhost:54310/user/hive/warehouse/cpj_tbl/cpj.csv.gz not a SequenceFile

Did you create table with sequencefile ?

On Mon, Sep 16, 2013 at 1:33 PM, Garg, Rinku rinku.g...@fisglobal.com wrote:
Hi All,

I have setup Hadoop, hive setup and trying to load gzip file in hadoop cluster. 
Files are loaded successfully and can be view on web UI. While executing Select 
query it gives me the below mentioned error.

ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:nxtbig (auth:SIMPLE) cause:java.io.IOException: 
java.lang.reflect.InvocationTargetException
2013-09-16 09:11:18,971 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:369)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.init(HadoopShimsSecure.java:316)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:430)
at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:395)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:355)
... 10 more
Caused by: java.io.IOException: 
hdfs://localhost:54310/user/hive/warehouse/cpj_tbl/cpj.csv.gz not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1805)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1714)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1728)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.init(CombineHiveRecordReader.java:65)
... 15 more

Can anybody help me on this.

Thanks  Regards,
Rinku Garg


_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.




--
Nitin Pawar
_
The information contained in this message is proprietary

RE: User accounts to execute hive queries

2013-09-16 Thread shouvanik.haldar
Hi Nitin,

Users want to execute hive queries from their user name? Is it possible or they 
have to do it logged in as hive user?

Thanks,
Shouvanik

-Original Message-
From: Haldar, Shouvanik
Sent: Monday, September 16, 2013 4:06 PM
To: user@hive.apache.org
Subject: User accounts to execute hive queries

Hi,
Can you please tell me if its possible to execute hive queries as different 
users?

Can we create read-only access for hive?
Please help.
Thanks
Shouvanik

Sent from my Windows Phone


This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited.

Where allowed by local law, electronic communications with Accenture and its 
affiliates, including e-mail and instant messaging (including content), may be 
scanned by our systems for the purposes of information security and assessment 
of internal compliance with Accenture policy.

__

www.accenture.com


Re: User accounts to execute hive queries

2013-09-16 Thread Nitin Pawar
You will need to tell few more things.
Do you want it secured?
Do you distinguish users in different categories on what one particular
user can do or not?
What kind of security do you have on hdfs?


It is definitely possible for users to run queries on their own username
but then you have to take few measures as well.
which user can do what action. Which user can access what location on hdfs
etc

For user management on hive side you can read at
https://cwiki.apache.org/Hive/languagemanual-authorization.html

if you do not want to go through the secure way,
then add all the users to one group and then grant permissions to that
group on your warehouse directory.

other way if the table data is not shared then,
create individual directory for each user on hdfs and give only that user
access to that directory.


Re: Optimizing hive queries

2013-03-29 Thread Jagat Singh
Hello Owen,

Thanks for your reply.

I am seeing its providing the advantage which Avro provided , of adding and
removing fields.

Can you please write some sample code for hive table which is partitioned
and each partitioned has different schema.

I tried searching but could not find any example.

Thanks in advance for your help.

Regards,

Jagat Singh

On Fri, Mar 29, 2013 at 4:48 PM, Owen O'Malley omal...@apache.org wrote:

 Actually, Hive already has the ability to have different schemas for
 different partitions. (Although of course it would be nice to have the
 alter table be more flexible!)

 The versioned metadata means that the ORC file's metadata is stored in
 ProtoBufs so that we can add (or remove) fields to the metadata. That means
 that for some changes to ORC file format we can provide both forward and
 backward compatibility.

 -- Owen


 On Thu, Mar 28, 2013 at 10:25 PM, Jagat Singh jagatsi...@gmail.comwrote:

 Hello Nitin,

 Thanks for sharing.

 Do we have more details on

 Versioned metadata feature of ORC ? , is it like handling varying schemas
 in Hive?

 Regards,

 Jagat Singh



 On Fri, Mar 29, 2013 at 4:16 PM, Nitin Pawar nitinpawar...@gmail.comwrote:


 Hi,

 Here is is a nice presentation from Owen from Hortonworks on Optimizing
 hive queries

 http://www.slideshare.net/oom65/optimize-hivequeriespptx



 Thanks,
 Nitin Pawar






Re: Optimizing hive queries

2013-03-29 Thread Owen O'Malley
On Thu, Mar 28, 2013 at 11:08 PM, Jagat Singh jagatsi...@gmail.com wrote:

 Hello Owen,

 Thanks for your reply.

 I am seeing its providing the advantage which Avro provided , of adding
 and removing fields.


ORC files like Avro files are self-describing. They include the type
structure of the records in the metadata of the file. It will take more
integration work with hive to make the schemas very flexible with ORC.


 Can you please write some sample code for hive table which is partitioned
 and each partitioned has different schema.


As with all tables:

create table people (first_name string, last_name string) partitioned by
(state string);
load data local inpath 'part-0' overwrite into table people partition
(state='ca');
alter table people add columns (address string);
load data local inpath 'part-1' overwrite into table people partition
(state='nv');

You'll end up with the first partition with 2 columns (and thus implicitly
the third one is null) and the second partition with 3 columns.

-- Owen




 I tried searching but could not find any example.

 Thanks in advance for your help.

 Regards,

 Jagat Singh


 On Fri, Mar 29, 2013 at 4:48 PM, Owen O'Malley omal...@apache.org wrote:

 Actually, Hive already has the ability to have different schemas for
 different partitions. (Although of course it would be nice to have the
 alter table be more flexible!)

 The versioned metadata means that the ORC file's metadata is stored in
 ProtoBufs so that we can add (or remove) fields to the metadata. That means
 that for some changes to ORC file format we can provide both forward and
 backward compatibility.

 -- Owen


 On Thu, Mar 28, 2013 at 10:25 PM, Jagat Singh jagatsi...@gmail.comwrote:

 Hello Nitin,

 Thanks for sharing.

 Do we have more details on

 Versioned metadata feature of ORC ? , is it like handling varying
 schemas in Hive?

 Regards,

 Jagat Singh



 On Fri, Mar 29, 2013 at 4:16 PM, Nitin Pawar nitinpawar...@gmail.comwrote:


 Hi,

 Here is is a nice presentation from Owen from Hortonworks on
 Optimizing hive queries

 http://www.slideshare.net/oom65/optimize-hivequeriespptx



 Thanks,
 Nitin Pawar







Optimizing hive queries

2013-03-28 Thread Nitin Pawar
Hi,

Here is is a nice presentation from Owen from Hortonworks on Optimizing
hive queries

http://www.slideshare.net/oom65/optimize-hivequeriespptx



Thanks,
Nitin Pawar


Re: Optimizing hive queries

2013-03-28 Thread Jagat Singh
Hello Nitin,

Thanks for sharing.

Do we have more details on

Versioned metadata feature of ORC ? , is it like handling varying schemas
in Hive?

Regards,

Jagat Singh


On Fri, Mar 29, 2013 at 4:16 PM, Nitin Pawar nitinpawar...@gmail.comwrote:


 Hi,

 Here is is a nice presentation from Owen from Hortonworks on Optimizing
 hive queries

 http://www.slideshare.net/oom65/optimize-hivequeriespptx



 Thanks,
 Nitin Pawar



Re: Optimizing hive queries

2013-03-28 Thread Nitin Pawar
I could just find this link
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

according to this, the metadata is handled by protobuf which allows of
adding/removing fields.


On Fri, Mar 29, 2013 at 10:55 AM, Jagat Singh jagatsi...@gmail.com wrote:

 Hello Nitin,

 Thanks for sharing.

 Do we have more details on

 Versioned metadata feature of ORC ? , is it like handling varying schemas
 in Hive?

 Regards,

 Jagat Singh



 On Fri, Mar 29, 2013 at 4:16 PM, Nitin Pawar nitinpawar...@gmail.comwrote:


 Hi,

 Here is is a nice presentation from Owen from Hortonworks on Optimizing
 hive queries

 http://www.slideshare.net/oom65/optimize-hivequeriespptx



 Thanks,
 Nitin Pawar





-- 
Nitin Pawar


Re: Optimizing hive queries

2013-03-28 Thread Owen O'Malley
Actually, Hive already has the ability to have different schemas for
different partitions. (Although of course it would be nice to have the
alter table be more flexible!)

The versioned metadata means that the ORC file's metadata is stored in
ProtoBufs so that we can add (or remove) fields to the metadata. That means
that for some changes to ORC file format we can provide both forward and
backward compatibility.

-- Owen


On Thu, Mar 28, 2013 at 10:25 PM, Jagat Singh jagatsi...@gmail.com wrote:

 Hello Nitin,

 Thanks for sharing.

 Do we have more details on

 Versioned metadata feature of ORC ? , is it like handling varying schemas
 in Hive?

 Regards,

 Jagat Singh



 On Fri, Mar 29, 2013 at 4:16 PM, Nitin Pawar nitinpawar...@gmail.comwrote:


 Hi,

 Here is is a nice presentation from Owen from Hortonworks on Optimizing
 hive queries

 http://www.slideshare.net/oom65/optimize-hivequeriespptx



 Thanks,
 Nitin Pawar





Re: Where is the location of hive queries

2013-03-06 Thread Sai Sai
After we run a query in hive shell as:
Select * from myTable;

Are these results getting saved to any file apart from the console/terminal 
display.
If so where is the location of the results.
Thanks
Sai


Re: Where is the location of hive queries

2013-03-06 Thread Nitin Pawar
the results are not stored to any file .. they are available on console only

if you want to save to the results then write execute your query like hive
-e query  file


On Wed, Mar 6, 2013 at 9:32 PM, Sai Sai saigr...@yahoo.in wrote:

 After we run a query in hive shell as:
 Select * from myTable;

 Are these results getting saved to any file apart from the
 console/terminal display.
 If so where is the location of the results.
 Thanks
 Sai




-- 
Nitin Pawar


Re: Where is the location of hive queries

2013-03-06 Thread Dean Wampler
Or use a variant of the INSERT statement to write to a directory or a table.

On Wed, Mar 6, 2013 at 10:05 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 the results are not stored to any file .. they are available on console
 only

 if you want to save to the results then write execute your query like hive
 -e query  file


 On Wed, Mar 6, 2013 at 9:32 PM, Sai Sai saigr...@yahoo.in wrote:

 After we run a query in hive shell as:
 Select * from myTable;

 Are these results getting saved to any file apart from the
 console/terminal display.
 If so where is the location of the results.
 Thanks
 Sai




 --
 Nitin Pawar




-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Hive queries

2013-02-25 Thread Cyril Bogus
Hi everyone,

My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
Mahout 0.7

I have imported tables from a remote database directly into Hive using
Sqoop.

Somehow when I try to run Sqoop from Hadoop, the content

Hive is giving me trouble in bookkeeping of where the imported tables are
located. I have a Single Node setup.

Thank you for any answer and you can ask question if I was not specific
enough about my issue.

Cyril


Re: Hive queries

2013-02-25 Thread Nitin Pawar
any errors you see ?


On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus cyrilbo...@gmail.com wrote:

 Hi everyone,

 My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
 Mahout 0.7

 I have imported tables from a remote database directly into Hive using
 Sqoop.

 Somehow when I try to run Sqoop from Hadoop, the content

 Hive is giving me trouble in bookkeeping of where the imported tables are
 located. I have a Single Node setup.

 Thank you for any answer and you can ask question if I was not specific
 enough about my issue.

 Cyril




-- 
Nitin Pawar


Re: Hive queries

2013-02-25 Thread bejoy_ks
Hi Cyril

I believe you are using the derby meta store and then it should be an issue 
with the hive configs.

Derby is trying to create a metastore at your current dir from where you are 
starting hive. The tables exported by sqoop would be inside HIVE_HOME and hence 
you are not able to see the tables from getting on to hive CLI from other 
locations.

To have a universal metastore db configure a specific dir in 
javax.jdo.option.ConnectionURL in hive-site.xml . In your conn url configure 
the db name as databaseName=/home/hive/metastore_db

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Cyril Bogus cyrilbo...@gmail.com
Date: Mon, 25 Feb 2013 10:34:29 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hive queries

I do not get any errors.
It is only when I run hive and try to query the tables I imported. Let's
say I want to only get numeric tuples for a given table. I cannot find the
table (show tables; is empty) unless I go in the hive home folder and run
hive again. I would expect the state of hive to be the same everywhere I
call it.
But so far it is not the case.


On Mon, Feb 25, 2013 at 10:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 any errors you see ?


 On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus cyrilbo...@gmail.com wrote:

 Hi everyone,

 My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
 Mahout 0.7

 I have imported tables from a remote database directly into Hive using
 Sqoop.

 Somehow when I try to run Sqoop from Hadoop, the content

 Hive is giving me trouble in bookkeeping of where the imported tables are
 located. I have a Single Node setup.

 Thank you for any answer and you can ask question if I was not specific
 enough about my issue.

 Cyril




 --
 Nitin Pawar




Re: Hive queries

2013-02-25 Thread Cyril Bogus
Thank you so much Bejoy,
That was my issue.
Now that I saw the config file I see that I was the one needing a universal
database.

Thanks again,
Regards
Cyril


On Mon, Feb 25, 2013 at 10:47 AM, bejoy...@yahoo.com wrote:

 **
 Hi Cyril

 I believe you are using the derby meta store and then it should be an
 issue with the hive configs.

 Derby is trying to create a metastore at your current dir from where you
 are starting hive. The tables exported by sqoop would be inside HIVE_HOME
 and hence you are not able to see the tables from getting on to hive CLI
 from other locations.

 To have a universal metastore db configure a specific dir in
 javax.jdo.option.ConnectionURL in hive-site.xml . In your conn url
 configure the db name as databaseName=/home/hive/metastore_db
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Cyril Bogus cyrilbo...@gmail.com
 *Date: *Mon, 25 Feb 2013 10:34:29 -0500
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Re: Hive queries

 I do not get any errors.
 It is only when I run hive and try to query the tables I imported. Let's
 say I want to only get numeric tuples for a given table. I cannot find the
 table (show tables; is empty) unless I go in the hive home folder and run
 hive again. I would expect the state of hive to be the same everywhere I
 call it.
 But so far it is not the case.


 On Mon, Feb 25, 2013 at 10:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 any errors you see ?


 On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus cyrilbo...@gmail.comwrote:

 Hi everyone,

 My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
 Mahout 0.7

 I have imported tables from a remote database directly into Hive using
 Sqoop.

 Somehow when I try to run Sqoop from Hadoop, the content

 Hive is giving me trouble in bookkeeping of where the imported tables
 are located. I have a Single Node setup.

 Thank you for any answer and you can ask question if I was not specific
 enough about my issue.

 Cyril




 --
 Nitin Pawar





Re: Hive Queries

2013-02-17 Thread Edward Capriolo
Dude sorry for the off topic, but having a rocketmail account is
awesome. I wish I still had mine.

On Sat, Feb 16, 2013 at 9:16 PM, manishbh...@rocketmail.com
manishbh...@rocketmail.com wrote:

 When you want to move data from external system to hive, this means moving
 data to HDFS first and then point the Hive table to the file in HDFS where
 you have exported the data.
 So, you have couple of commands like -copyFromLocal and fget which move the
 file to hdfs. If you intent to move in real time fashion try Flume. But end
 of the day the data movement first happens in HDFS and then hive table can
 be loaded using Load table command.

 Regards,
 Manish Bhoge
 sent by HTC device. Excuse typo.

 - Reply message -
 From: Cyrille Djoko c...@agnik.com
 To: user@hive.apache.org
 Subject: Hive Queries
 Date: Sat, Feb 16, 2013 1:50 AM


 Hi Jarcec,
 I did try Sqoop. I am running sqoop 1.4.2 --hadoop1.0.0 along with hadoop
 1.0.4 But I keep running on the following exception.

 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 class org.apache.hadoop.mapreduce.JobContext, but interface was expected

 So I wrote a small program but all I can do is send queries to the server.
 Hi Cyrille,
 I'm not exactly sure what exactly you mean, so I'm more or less blindly
 shooting, but maybe Apache Sqoop [1] might help you?

 Jarcec

 Links:
 1: http://sqoop.apache.org/

 On Fri, Feb 15, 2013 at 01:44:45PM -0500, Cyrille Djoko wrote:
 I am looking for a relatively efficient way of transferring data between
 a
 remote server and Hive without going through the hassle of storing the
 data first on memory before loading it to Hive.
 From what I have read so far there is no such command but it would not
 hurt to ask.
 Is it possible to insert data through an insert query in hive? (The
 equivalent to insert into table_name
 values (...) in xSQLx)

 Thank you in advance for an answer.


 Cyrille Djoko
 Data Mining Developer Intern




 Cyrille Djoko

 Agnik LLC
 Data Mining Developer Intern



Re: Hive Queries

2013-02-16 Thread manishbh...@rocketmail.com

When you want to move data from external system to hive, this means moving data 
to HDFS first and then point the Hive table to the file in HDFS where you have 
exported the data.
So, you have couple of commands like -copyFromLocal and fget which move the 
file to hdfs. If you intent to move in real time fashion try Flume. But end of 
the day the data movement first happens in HDFS and then hive table can be 
loaded using Load table command.

Regards,
Manish Bhoge
sent by HTC device. Excuse typo.

- Reply message -
From: Cyrille Djoko c...@agnik.com
To: user@hive.apache.org
Subject: Hive Queries
Date: Sat, Feb 16, 2013 1:50 AM


Hi Jarcec,
I did try Sqoop. I am running sqoop 1.4.2 --hadoop1.0.0 along with hadoop
1.0.4 But I keep running on the following exception.

Exception in thread main java.lang.IncompatibleClassChangeError: Found
class org.apache.hadoop.mapreduce.JobContext, but interface was expected

So I wrote a small program but all I can do is send queries to the server.
 Hi Cyrille,
 I'm not exactly sure what exactly you mean, so I'm more or less blindly
 shooting, but maybe Apache Sqoop [1] might help you?

 Jarcec

 Links:
 1: http://sqoop.apache.org/

 On Fri, Feb 15, 2013 at 01:44:45PM -0500, Cyrille Djoko wrote:
 I am looking for a relatively efficient way of transferring data between
 a
 remote server and Hive without going through the hassle of storing the
 data first on memory before loading it to Hive.
 From what I have read so far there is no such command but it would not
 hurt to ask.
 Is it possible to insert data through an insert query in hive? (The
 equivalent to insert into table_name
 values (...) in xSQLx)

 Thank you in advance for an answer.


 Cyrille Djoko
 Data Mining Developer Intern




Cyrille Djoko

Agnik LLC
Data Mining Developer Intern



Hive Queries

2013-02-15 Thread Cyrille Djoko
I am looking for a relatively efficient way of transferring data between a
remote server and Hive without going through the hassle of storing the
data first on memory before loading it to Hive.
From what I have read so far there is no such command but it would not
hurt to ask.
Is it possible to insert data through an insert query in hive? (The
equivalent to insert into table_name
values (...) in xSQLx)

Thank you in advance for an answer.


Cyrille Djoko
Data Mining Developer Intern



Re: Hive Queries

2013-02-15 Thread Jarek Jarcec Cecho
Hi Cyrille,
I'm not exactly sure what exactly you mean, so I'm more or less blindly 
shooting, but maybe Apache Sqoop [1] might help you?

Jarcec

Links:
1: http://sqoop.apache.org/

On Fri, Feb 15, 2013 at 01:44:45PM -0500, Cyrille Djoko wrote:
 I am looking for a relatively efficient way of transferring data between a
 remote server and Hive without going through the hassle of storing the
 data first on memory before loading it to Hive.
 From what I have read so far there is no such command but it would not
 hurt to ask.
 Is it possible to insert data through an insert query in hive? (The
 equivalent to insert into table_name
 values (...) in xSQLx)
 
 Thank you in advance for an answer.
 
 
 Cyrille Djoko
 Data Mining Developer Intern
 


signature.asc
Description: Digital signature


Re: Hive Queries

2013-02-15 Thread Cyrille Djoko
Hi Jarcec,
I did try Sqoop. I am running sqoop 1.4.2 --hadoop1.0.0 along with hadoop
1.0.4 But I keep running on the following exception.

Exception in thread main java.lang.IncompatibleClassChangeError: Found
class org.apache.hadoop.mapreduce.JobContext, but interface was expected

So I wrote a small program but all I can do is send queries to the server.
 Hi Cyrille,
 I'm not exactly sure what exactly you mean, so I'm more or less blindly
 shooting, but maybe Apache Sqoop [1] might help you?

 Jarcec

 Links:
 1: http://sqoop.apache.org/

 On Fri, Feb 15, 2013 at 01:44:45PM -0500, Cyrille Djoko wrote:
 I am looking for a relatively efficient way of transferring data between
 a
 remote server and Hive without going through the hassle of storing the
 data first on memory before loading it to Hive.
 From what I have read so far there is no such command but it would not
 hurt to ask.
 Is it possible to insert data through an insert query in hive? (The
 equivalent to insert into table_name
 values (...) in xSQLx)

 Thank you in advance for an answer.


 Cyrille Djoko
 Data Mining Developer Intern




Cyrille Djoko

Agnik LLC
Data Mining Developer Intern



Re: Hive Queries

2013-02-15 Thread Jarek Jarcec Cecho
[-user@hive, +user@sqoop]

Hi Cyrille,
this seems to me more a Sqoop issue than Hive issue, so I've moved this email 
to user@sqoop mailing list. I'm keeping user@hive in Bcc so that the mailing 
list will get the memo. Please join the user@sqoop mailing list [1] to receive 
additional feedback.

 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 class org.apache.hadoop.mapreduce.JobContext, but interface was expected

Exception that you're getting is typical when one is running code compiled 
against Hadoop 2.0 on Hadoop 1.0 or vice versa. You've specified that you're 
running Sqoop 1.4.2 --hadoop1.0.0.0, but that do not seem to be the case. Would 
you mind downloading it again from our mirror [2] and retrying it? 

Jarcec

Links:
1: http://sqoop.apache.org/mail-lists.html
2: http://www.apache.org/dist/sqoop/1.4.2/

On Fri, Feb 15, 2013 at 03:20:09PM -0500, Cyrille Djoko wrote:
 Hi Jarcec,
 I did try Sqoop. I am running sqoop 1.4.2 --hadoop1.0.0 along with hadoop
 1.0.4 But I keep running on the following exception.
 
 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 class org.apache.hadoop.mapreduce.JobContext, but interface was expected
 
 So I wrote a small program but all I can do is send queries to the server.
  Hi Cyrille,
  I'm not exactly sure what exactly you mean, so I'm more or less blindly
  shooting, but maybe Apache Sqoop [1] might help you?
 
  Jarcec
 
  Links:
  1: http://sqoop.apache.org/
 
  On Fri, Feb 15, 2013 at 01:44:45PM -0500, Cyrille Djoko wrote:
  I am looking for a relatively efficient way of transferring data between
  a
  remote server and Hive without going through the hassle of storing the
  data first on memory before loading it to Hive.
  From what I have read so far there is no such command but it would not
  hurt to ask.
  Is it possible to insert data through an insert query in hive? (The
  equivalent to insert into table_name
  values (...) in xSQLx)
 
  Thank you in advance for an answer.
 
 
  Cyrille Djoko
  Data Mining Developer Intern
 
 
 
 
 Cyrille Djoko
 
 Agnik LLC
 Data Mining Developer Intern
 


signature.asc
Description: Digital signature


Run hive queries, and collect job information

2013-01-30 Thread Mathieu Despriee
Hi folks,

I would like to run a list of generated HIVE queries. For each, I would
like to retrieve the MR job_id (or ids, in case of multiple stages). And
then, with this job_id, collect statistics from job tracker (cumulative
CPU, read bytes...)

How can I send HIVE queries from a bash or python script, and retrieve the
job_id(s) ?

For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop
cluster, so I don't have the AppMaster REST
APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html.
I'm about to collect data from the jobtracker web UI. Any better idea ?

Mathieu


Re: Run hive queries, and collect job information

2013-01-30 Thread Qiang Wang
Every hive query has a history file, and you can get these info from hive
history file

Following java code can be an example:
https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java

Regard,
Qiang


2013/1/30 Mathieu Despriee mdespr...@octo.com

 Hi folks,

 I would like to run a list of generated HIVE queries. For each, I would
 like to retrieve the MR job_id (or ids, in case of multiple stages). And
 then, with this job_id, collect statistics from job tracker (cumulative
 CPU, read bytes...)

 How can I send HIVE queries from a bash or python script, and retrieve the
 job_id(s) ?

 For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop
 cluster, so I don't have the AppMaster REST 
 APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html.
 I'm about to collect data from the jobtracker web UI. Any better idea ?

 Mathieu





Re: Run hive queries, and collect job information

2013-01-30 Thread Nitin Pawar
for all the queries you run as user1 .. hive stores the hive cli history
into .hive_history file (please check the limits on how many queries it
stores)

For all the jobs hive cli runs, it keeps the details in /tmp/user.name/

all these values are configurable into hive-site.xml


On Wed, Jan 30, 2013 at 3:55 PM, Qiang Wang wsxy...@gmail.com wrote:

 Every hive query has a history file, and you can get these info from hive
 history file

 Following java code can be an example:

 https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java

 Regard,
 Qiang


 2013/1/30 Mathieu Despriee mdespr...@octo.com

 Hi folks,

 I would like to run a list of generated HIVE queries. For each, I would
 like to retrieve the MR job_id (or ids, in case of multiple stages). And
 then, with this job_id, collect statistics from job tracker (cumulative
 CPU, read bytes...)

 How can I send HIVE queries from a bash or python script, and retrieve
 the job_id(s) ?

 For the 2nd part (collecting stats for the job), we're using a MRv1
 Hadoop cluster, so I don't have the AppMaster REST 
 APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html.
 I'm about to collect data from the jobtracker web UI. Any better idea ?

 Mathieu






-- 
Nitin Pawar


Re: Run hive queries, and collect job information

2013-01-30 Thread Mathieu Despriee
Fantastic.
Thanks !


2013/1/30 Qiang Wang wsxy...@gmail.com

 Every hive query has a history file, and you can get these info from hive
 history file

 Following java code can be an example:

 https://github.com/anjuke/hwi/blob/master/src/main/java/org/apache/hadoop/hive/hwi/util/QueryUtil.java

 Regard,
 Qiang


 2013/1/30 Mathieu Despriee mdespr...@octo.com

 Hi folks,

 I would like to run a list of generated HIVE queries. For each, I would
 like to retrieve the MR job_id (or ids, in case of multiple stages). And
 then, with this job_id, collect statistics from job tracker (cumulative
 CPU, read bytes...)

 How can I send HIVE queries from a bash or python script, and retrieve
 the job_id(s) ?

 For the 2nd part (collecting stats for the job), we're using a MRv1
 Hadoop cluster, so I don't have the AppMaster REST 
 APIhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html.
 I'm about to collect data from the jobtracker web UI. Any better idea ?

 Mathieu






Re: REST API for Hive queries?

2012-12-13 Thread Manish Malhotra
Ideally, push the aggregated data to some RDBMS like MySQL and have REST
API or some API to enable ui to build report or query out of it.

If the use case is ad-hoc query then once that qry is submitted, and result
is generated in batch mode, the REST API can be provided to get the results
from HDFS directly.
For this can use WebHDFS or build own which can internally using FileSystem
API.

Regards,
Manish


On Wed, Dec 12, 2012 at 11:30 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Hive takes a longer time to respond to queries as the data gets larger.

 Best way to handle this is you process the data on hive and store in some
 rdbms like mysql etc.
 On top of that then you can write your own API or use pentaho like
 interface where they can write the queries or see predefined reports.

 Alternatively, pentaho does have hive connection as well. There are other
 platforms such as talend, datameer etc. You can have a look at them


 On Thu, Dec 13, 2012 at 1:15 AM, Leena Gupta gupta.le...@gmail.comwrote:

 Hi,

 We are using Hive as our data warehouse to run various queries on large
 amounts of data. There are some users who would like to get access to the
 output of these queries and display the data on an existing UI application.
 What is the best way to give them the output of these queries? Should we
 write REST APIs that the Front end can call to get the data? How can this
 be done?
  I'd like to know what have other people done to meet this requirement ?
 Any pointers would be very helpful.
 Thanks.




 --
 Nitin Pawar



Re: REST API for Hive queries?

2012-12-13 Thread Jagat Singh
If your requirement is that queries are not going to be run on fly then i
would suggest following.

1) Create Hive script
2) Combine it with Oozie workflow to run at scheduled time and push results
to some DB say MySQL
3) Use some application to talk to MySQL and generate those reports.

Thanks,

Jagat Singh



On Thu, Dec 13, 2012 at 7:15 PM, Manish Malhotra 
manish.hadoop.w...@gmail.com wrote:


 Ideally, push the aggregated data to some RDBMS like MySQL and have REST
 API or some API to enable ui to build report or query out of it.

 If the use case is ad-hoc query then once that qry is submitted, and
 result is generated in batch mode, the REST API can be provided to get the
 results from HDFS directly.
 For this can use WebHDFS or build own which can internally using
 FileSystem API.

 Regards,
 Manish


 On Wed, Dec 12, 2012 at 11:30 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Hive takes a longer time to respond to queries as the data gets larger.

 Best way to handle this is you process the data on hive and store in some
 rdbms like mysql etc.
 On top of that then you can write your own API or use pentaho like
 interface where they can write the queries or see predefined reports.

 Alternatively, pentaho does have hive connection as well. There are other
 platforms such as talend, datameer etc. You can have a look at them


  On Thu, Dec 13, 2012 at 1:15 AM, Leena Gupta gupta.le...@gmail.comwrote:

 Hi,

 We are using Hive as our data warehouse to run various queries on large
 amounts of data. There are some users who would like to get access to the
 output of these queries and display the data on an existing UI application.
 What is the best way to give them the output of these queries? Should we
 write REST APIs that the Front end can call to get the data? How can this
 be done?
  I'd like to know what have other people done to meet this requirement ?
 Any pointers would be very helpful.
 Thanks.




 --
 Nitin Pawar





  1   2   >