Re: mapreduce on hive with HCatInputFormat and skip.header.line.count=1

2017-06-07 Thread vinay gupta

Hello hive-users,  I am reading a hive table with skip.header.line.count  set 
to 1 in TBLPROPERTIES
In the driver code I do this.

    val hiveMetaStoreClient = new HiveMetaStoreClient(new 
HiveConf(job.getConfiguration, HiveIngestDriver.getClass))
val hiveTable:Table = hiveMetaStoreClient.getTable("default", 
"hiveTableName")
val hiveTableProperties = new Properties()
hiveTableProperties.putAll(hiveTable.getParameters)
logger.info("size: {} getParameters: {}", hiveTable.getParametersSize, 
hiveTableProperties.toMap)
    val hCatInputFormat = HCatInputFormat.setInput(job.getConfiguration, 
"default", "hiveTableName", "day=2017-06-01")
hCatInputFormat.setProperties(hiveTableProperties)

job.setInputFormatClass(classOf[HCatInputFormat])



Log from above shows that skip.header.line.count is set correctly. Even then 
HCatInputFormat is unable to apply this as I see the header row in the output.

""
size: 4 getParameters: {last_modified_by=myuser, last_modified_time=1468952183, 
transient_lastDdlTime=1468952183, skip.header.line.count=1}""

Any suggestions???
Thanks,-Vinay


Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-07 Thread Uli Bethke

Why are you thinking of using HBase?

Just store the SCD versions in a normal Hive dimension table. In case 
you are worried about updates to columns such as 'valid to' and 'latest 
record indicator' you can calculate these on the fly using window 
functions. No need to create and update them physically. You can read 
more about it here 
https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/




On 07/06/2017 11:13, Ramasubramanian Narayanan wrote:

Hi,

Can you please let us know Pro and Cons of using HBase table as an 
external table in HIVE.


Will there be any performance degrade when using Hive over HBase 
instead of using direct HIVE table.


The table that I am planning to use in HBase will be master table like 
account, customer. Wanting to achieve Slowly Changing Dimension. 
Please through some lights on that too if you have done any such 
implementations.


Thanks and Regards,
Rams


--
___
Uli Bethke
CEO Sonra
p: +353 86 32 83 040
w: www.sonra.io
l: linkedin.com/in/ulibethke
t: twitter.com/ubethke
s: uli.bethke

Chair Hadoop User Group Ireland
www.hugireland.org
Associate President DAMA Ireland



Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-07 Thread Mich Talebzadeh
As I know using Hive on Hbase can only be done through Hive

Example

hive>  create external table MARKETDATAHBASE (key STRING, TICKER STRING,
TIMECREATED STRING, PRICE STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
":key,PRICE_INFO:TICKER,PRICE_INFO:TIMECREATED,PRICE_INFO:PRICE")

TBLPROPERTIES ("hbase.table.name" = "MARKETDATAHBASE");


The problem here is that like most Hive external tables you are creating a
pointer to Hbase with Hive storage handler and there is very little
optimization that can be done.


In all probability you would be better off using Apache  Phoenix on top of
Hbase with Phoenix secondary indexes. Granted the SQL capability in Phoenix
may not be that good as Hive but should do for most purposes.


In Phoenix you can do:



CREATE TABLE MARKETDATAHBASE (PK VARCHAR PRIMARY KEY, PRICE_INFO.TICKER
VARCHAR, PRICE_INFO.TIMECREATED VARCHAR, PRICE_INFO.PRICE VARCHAR);



HTH,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 June 2017 at 11:13, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Can you please let us know Pro and Cons of using HBase table as an
> external table in HIVE.
>
> Will there be any performance degrade when using Hive over HBase instead
> of using direct HIVE table.
>
> The table that I am planning to use in HBase will be master table like
> account, customer. Wanting to achieve Slowly Changing Dimension. Please
> through some lights on that too if you have done any such implementations.
>
> Thanks and Regards,
> Rams
>


Re: Hive 2.2

2017-06-07 Thread Barna Zsombor Klara
Hi Boris,

you can build from the branches branch-2.3, branch-2 or master. All of
these should have spark 2.0 based on the git commit history.
However these are non-released versions, so I'm not sure what you can
expect in terms of stability.

On Wed, Jun 7, 2017 at 4:05 PM, Boris Lublinsky <
boris.lublin...@lightbend.com> wrote:

> Thanks Vergil
> I do not see version 2.4 The Highest branch is 2.3.RC
> Is 2.4 master?
>
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jun 6, 2017, at 10:07 PM, vergil  wrote:
>
> Hi,
> You can build distribution from source code,github URL:
> https://github.com/apache/hive/releases.And you can follow the guide
> https://cwiki.apache.org/confluence/display/Hive/
> GettingStarted#GettingStarted-BuildingHivefromSource.
> But,I found it that Hive2.2 2.3 version do not support Spark2.0 when I
> read the source code.Because Hive2.2 2.3 version JobMetricsListener.java
>  still use JavaSparkListener,JavaSparkListener change to SparkListener
> from Spark2.0,github commit history https://github.com/apache/hive/commit/
> ac977cc88757b49fbbd5c3bb236adcedcaae396c.Hive2.2 2.3 version do not
> include this commit。
>
> so,you can build hive 2.4 version or later from source code,it can support
> spark2.0
>
>
>


Re: Hive 2.2

2017-06-07 Thread Boris Lublinsky
Thanks Vergil
I do not see version 2.4 The Highest branch is 2.3.RC
Is 2.4 master?


Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

> On Jun 6, 2017, at 10:07 PM, vergil  wrote:
> 
> Hi,
> You can build distribution from source code,github 
> URL:https://github.com/apache/hive/releases 
> .And you can follow the guide 
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-BuildingHivefromSource
>  
> .
> But,I found it that Hive2.2 2.3 version do not support Spark2.0 when I read 
> the source code.Because Hive2.2 2.3 version JobMetricsListener.java  still 
> use JavaSparkListener,JavaSparkListener change to SparkListener from 
> Spark2.0,github commit history 
> https://github.com/apache/hive/commit/ac977cc88757b49fbbd5c3bb236adcedcaae396c
>  
> .Hive2.2
>  2.3 version do not include this commit。
> 
> so,you can build hive 2.4 version or later from source code,it can support 
> spark2.0



Pro and Cons of using HBase table as an external table in HIVE

2017-06-07 Thread Ramasubramanian Narayanan
Hi,

Can you please let us know Pro and Cons of using HBase table as an external
table in HIVE.

Will there be any performance degrade when using Hive over HBase instead of
using direct HIVE table.

The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.

Thanks and Regards,
Rams


Re: meet error when building hive-2.4.x from source

2017-06-07 Thread Bing Li
Hi,
Please try to build hive-storage-api module in local ahead.
e.g.
cd storage-api
mvn clean install -DskipTests

And then build the whole hive project.

2017-06-05 17:20 GMT+08:00 赵伟 :

> hi!
> First of all,Thank you for your reading my letter.
> I meet a problem when I build 2.4.x branch from source code.
> My building command:mvn clean package -Pdist -e
> Here is the stack trace:
> [INFO] Hive ... SUCCESS [
>  1.955 s]
> [INFO] Hive Shims Common .. SUCCESS [
>  6.070 s]
> [INFO] Hive Shims 0.23  SUCCESS [
>  4.526 s]
> [INFO] Hive Shims Scheduler ... SUCCESS [
>  1.775 s]
> [INFO] Hive Shims . SUCCESS [
>  0.994 s]
> [INFO] Hive Common  SUCCESS [
> 51.173 s]
> [INFO] Hive Service RPC ... SUCCESS [
>  4.936 s]
> [INFO] Hive Serde . FAILURE [
>  0.063 s]
> [INFO] Hive Metastore . SKIPPED
> [INFO] Hive Vector-Code-Gen Utilities . SKIPPED
> [INFO] Hive Llap Common ... SKIPPED
> [INFO] Hive Llap Client ... SKIPPED
> [INFO] Hive Llap Tez .. SKIPPED
> [INFO] Spark Remote Client  SKIPPED
> [INFO] Hive Query Language  SKIPPED
> [INFO] Hive Llap Server ... SKIPPED
> [INFO] Hive Service ... SKIPPED
> [INFO] Hive Accumulo Handler .. SKIPPED
> [INFO] Hive JDBC .. SKIPPED
> [INFO] Hive Beeline ... SKIPPED
> [INFO] Hive CLI ... SKIPPED
> [INFO] Hive Contrib ... SKIPPED
> [INFO] Hive Druid Handler . SKIPPED
> [INFO] Hive HBase Handler . SKIPPED
> [INFO] Hive JDBC Handler .. SKIPPED
> [INFO] Hive HCatalog .. SKIPPED
> [INFO] Hive HCatalog Core . SKIPPED
> [INFO] Hive HCatalog Pig Adapter .. SKIPPED
> [INFO] Hive HCatalog Server Extensions  SKIPPED
> [INFO] Hive HCatalog Webhcat Java Client .. SKIPPED
> [INFO] Hive HCatalog Webhcat .. SKIPPED
> [INFO] Hive HCatalog Streaming  SKIPPED
> [INFO] Hive HPL/SQL ... SKIPPED
> [INFO] Hive Llap External Client .. SKIPPED
> [INFO] Hive Shims Aggregator .. SKIPPED
> [INFO] Hive TestUtils . SKIPPED
> [INFO] Hive Packaging . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:12 min
> [INFO] Finished at: 2017-06-05T17:11:30+08:00
> [INFO] Final Memory: 77M/783M
> [INFO] 
> 
> [ERROR] Failed to execute goal on project hive-serde: Could not resolve
> dependencies for project org.apache.hive:hive-serde:jar:2.3.0: Failure to
> find org.apache.hive:hive-storage-api:jar:2.4.0 in
> http://www.datanucleus.org/downloads/maven2 was cached in the local
> repository, resolution will not be reattempted until the update interval of
> datanucleus has elapsed or updates are forced -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal on project hive-serde: Could not resolve dependencies for project
> org.apache.hive:hive-serde:jar:2.3.0: Failure to find
> org.apache.hive:hive-storage-api:jar:2.4.0 in http://www.datanucleus.org/
> downloads/maven2 was cached in the local repository, resolution will not
> be reattempted until the update interval of datanucleus has elapsed or
> updates are forced
> at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.
> getDependencies(LifecycleDependencyResolver.java:221)
> at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.
> resolveProjectDependencies(LifecycleDependencyResolver.java:127)
> at org.apache.maven.lifecycle.internal.MojoExecutor.
> ensureDependenciesAreResolved(MojoExecutor.java:245)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:199)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:153)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> 

deleting duplicates from large table

2017-06-07 Thread Tousif
Hi Users,

I want to know if it is possible to delete duplicates from large non
partitioned table.

How does ACID perform with large table with billions of rows.

-- 


Regards
Tousif Khazi