Re: Business Rules Engine for Hive

2018-04-16 Thread Joel D
The business rules we've here are currently embedded in hive code. They
range from basic standardization using case blocks to complex multi-column
validation.

Thanks.

On Mon, Apr 16, 2018 at 5:03 PM Jörn Franke  wrote:

> The question is what do your rules do? Do you need to maintain a factbase
> or do they just check data quality within certain tables?
>
> On 16. Apr 2018, at 22:28, Joel D  wrote:
>
> Ok.
>
> Rough ideas:
> To keep the business logic outside code, I was thinking to give a custom
> UI.
>
> Next read from UI data and build UDFs using the rules defined outside the
> UDF.
>
> 1 UDF per data object.
>
> Not sure these are just thoughts.
>
> On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke  wrote:
>
>> I would not use Drools with Spark, it does not scale to the distributed
>> setting.
>>
>> You could translate the rules to hive queries but this would not be
>> exactly the same thing.
>>
>> > On 16. Apr 2018, at 17:59, Joel D  wrote:
>> >
>> > Hi,
>> >
>> > Any suggestions on how to implement Business Rules Engine with Hive
>> ETLs?
>> >
>> > For spark based Etl jobs, I was exploring Drools but not sure about
>> Hive.
>> >
>> > Thanks.
>>
>


Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
The question is what do your rules do? Do you need to maintain a factbase or do 
they just check data quality within certain tables?

> On 16. Apr 2018, at 22:28, Joel D  wrote:
> 
> Ok. 
> 
> Rough ideas:
> To keep the business logic outside code, I was thinking to give a custom UI.
> 
> Next read from UI data and build UDFs using the rules defined outside the UDF.
> 
> 1 UDF per data object.
> 
> Not sure these are just thoughts. 
> 
>> On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke  wrote:
>> I would not use Drools with Spark, it does not scale to the distributed 
>> setting.
>> 
>> You could translate the rules to hive queries but this would not be exactly 
>> the same thing.
>> 
>> > On 16. Apr 2018, at 17:59, Joel D  wrote:
>> > 
>> > Hi,
>> > 
>> > Any suggestions on how to implement Business Rules Engine with Hive ETLs?
>> > 
>> > For spark based Etl jobs, I was exploring Drools but not sure about Hive.
>> > 
>> > Thanks. 


Re: Business Rules Engine for Hive

2018-04-16 Thread Joel D
Ok.

Rough ideas:
To keep the business logic outside code, I was thinking to give a custom UI.

Next read from UI data and build UDFs using the rules defined outside the
UDF.

1 UDF per data object.

Not sure these are just thoughts.

On Mon, Apr 16, 2018 at 1:40 PM Jörn Franke  wrote:

> I would not use Drools with Spark, it does not scale to the distributed
> setting.
>
> You could translate the rules to hive queries but this would not be
> exactly the same thing.
>
> > On 16. Apr 2018, at 17:59, Joel D  wrote:
> >
> > Hi,
> >
> > Any suggestions on how to implement Business Rules Engine with Hive ETLs?
> >
> > For spark based Etl jobs, I was exploring Drools but not sure about Hive.
> >
> > Thanks.
>


Re: Business Rules Engine for Hive

2018-04-16 Thread Joel D
Hi Pivonka, we are more inclined towards using open source products and
closer integration with Hive since we've most of our ETL in Hive.

Thanks.

On Mon, Apr 16, 2018 at 12:51 PM Al Pivonka  wrote:

> I am not the product owner an have not implemented it yet.
> I would check out
> http://cask.co/products/rules-engine/
>
> On Mon, Apr 16, 2018 at 11:59 AM, Joel D  wrote:
>
>> Hi,
>>
>> Any suggestions on how to implement Business Rules Engine with Hive ETLs?
>>
>> For spark based Etl jobs, I was exploring Drools but not sure about Hive.
>>
>> Thanks.
>>
>
>
>
> --
> Those who say it can't be done, are usually interrupted by those doing it.
>


Re: Hive Custom UDF evaluate behavior when @UDFType is set

2018-04-16 Thread Jason Dere
I'd suggested logging the stack trace of the call, the logs attached don't 
really give much information of where the calls are occurring during query 
compilation/execution.

Try logger.info("Inside testUdf Initialize***", new 
Exception("initialize");




From: PradeepKumar Yadav 
Sent: Monday, April 16, 2018 4:53 AM
To: user@hive.apache.org
Cc: Jason Dere
Subject: FW: Hive Custom UDF evaluate behavior when @UDFType is set

Hi,
Regarding the previous mail sent, I have attached following 
observation documents -

  1.  testUdfNoAnnotation.java - contains UDF code with no @UDFType annotation.
  2.  hive-default-No-annotation-log.txt - HiveServer2 logs after executing the 
UDF created through the above class
  3.  hive-default-udf-annotation.jpg - The beeline output after creating and 
executing UDF created through above class
  4.  testUdf.java - contains UDF code with no @UDFType( deterministic = false )
  5.  hive-deterministic-false-log.txt - JobHistory logs after executing the 
UDF created through the above class
  6.  hive-deterministic-false.jpg - The beeline output after creating and 
executing UDF created through above class

Thanks,
PradeepKumar Yadav
From: Jason Dere [mailto:jd...@hortonworks.com]
Sent: Wednesday, April 11, 2018 12:02 AM
To: user@hive.apache.org
Subject: Re: Hive Custom UDF evaluate behavior when @UDFType is set


Might have to do with constant propagation because the function was listed as 
deterministic. You can try logging the stack trace during execution and pasting 
both stack traces here, may help give more clues as to what is going on.




From: PradeepKumar Yadav 
mailto:pradeep.ya...@protegrity.com>>
Sent: Monday, April 9, 2018 11:35 PM
To: user@hive.apache.org
Subject: Hive Custom UDF evaluate behavior when @UDFType is set

Hi,
Recently while creating a custom generic hive UDF I came across 
a different behavior for the Evaluate method. The custom UDF had a logic to 
increment the counter and write it to a file. Now when I execute it directly 
without involving any table it always returns an extra count i.e. 2.
Now when I added some logs to inside the evaluate method I 
observed that the logs (sysout) were printed twice. Now on further research I 
came across the @UDFType annotation and found out that if we do not provide 
this annotation in our custom UDF, default value is deterministic true.
When I provide this annotation in my custom UDF and set 
@UDFType( deterministic = false ), I observed that my logs were printed only 
once and my UDF was returning the accurate count i.e. 1 therefore implying my 
evaluate was called only once when @UDFType( deterministic = false ).
Now I wanted to understand what is the connection between 
@UDFType and Evaluate method when UDF is invoked directly without a table.

Note : When I invoke my UDF on a table I get the appropriate 
count even with @UDFType( deterministic = true ).

Thanks in advance. :)
Regards,
PradeepKumar Yadav


Re: Business Rules Engine for Hive

2018-04-16 Thread Jörn Franke
I would not use Drools with Spark, it does not scale to the distributed setting.

You could translate the rules to hive queries but this would not be exactly the 
same thing.

> On 16. Apr 2018, at 17:59, Joel D  wrote:
> 
> Hi,
> 
> Any suggestions on how to implement Business Rules Engine with Hive ETLs?
> 
> For spark based Etl jobs, I was exploring Drools but not sure about Hive.
> 
> Thanks. 


Re: Business Rules Engine for Hive

2018-04-16 Thread Al Pivonka
I am not the product owner an have not implemented it yet.
I would check out
http://cask.co/products/rules-engine/

On Mon, Apr 16, 2018 at 11:59 AM, Joel D  wrote:

> Hi,
>
> Any suggestions on how to implement Business Rules Engine with Hive ETLs?
>
> For spark based Etl jobs, I was exploring Drools but not sure about Hive.
>
> Thanks.
>



-- 
Those who say it can't be done, are usually interrupted by those doing it.


Business Rules Engine for Hive

2018-04-16 Thread Joel D
Hi,

Any suggestions on how to implement Business Rules Engine with Hive ETLs?

For spark based Etl jobs, I was exploring Drools but not sure about Hive.

Thanks.


Re: Ways to reduce launching time of query in Hive 2.2.1

2018-04-16 Thread Sungwoo Park
Do you use Tez session pool along with LLAP (as Thai suggests in the
previous reply)? If a new query finds an idle AM in Tez session pool, there
will be no launch cost for AM. If no idle AM is found or if you specify a
queue name, a new AM should start in order to serve the query. This is
explained in detail in the following article (see 'Understanding #4'):

https://community.hortonworks.com/articles/56636/hive-understanding-concurrent-sessions-queue-alloc.html

Hence, if not enough AMs are available in Tez session pool, new queries
will have to wait until old queries are finished. If there are not many
concurrent queries, I guess using Tez session pool will solve your issue.

In a highly concurrent setting, Hive-MR3 practically eliminates this
limitation. In Hive-MR3, HiveServer2 in shared session mode launches a
single AppMaster to be shared by all incoming queries, so there is no
launch cost. Containers are also shared by all queries and thus run like
daemons.

https://mr3.postech.ac.kr/hivemr3/features/hiveserver2/

Hive-MR3 0.1 does not support LLAP IO yet, but Hive-MR3 0.2 will support
LLAP IO (which will be released by the end of this month.)

--- Sungwoo Park




On Mon, Apr 16, 2018 at 11:33 PM, Anup Tiwari 
wrote:

> Hi All,
>
> We have a use case where we need to return output in < 10 sec. We have
> evaluated different set of tool for execution and they work find but they
> do not cover all cases as well as they are not reliable(since they are in
> evolving phase). But Hive works well in this context.
>
> Using Hive LLAP, we have reduced query time to 6-7sec. But query launching
> takes ~12-15 sec due to which response time becomes 18-21 sec.
>
> Is there any way we can reduce this launching time?
>
> Please note that we have tried prewarm containers but when we are
> launching query from hive client then it is not picking containers from
> already initialized containers rather it launches its own.
>
> Please let me know how can we overcome this issue since this is the only
> problem which is stopping us from using Hive. Any links/description is
> really appreciated.
>
>
> Regards,
> Anup Tiwari
>


Re: Ways to reduce launching time of query in Hive 2.2.1

2018-04-16 Thread Thai Bui
The best approach would be to use a demonized containers such as Hive LLAP
+ Tez session pool or Spark on Hive.

I’m not that familiar with Spark on Hive so I can’t comment on it but Hive
on LLAP has worked really well for me when coupled with Tez session pool.
You’ll have to specify how many Tez AMs initialized per LLAP pool when
HiveServer2 started, and those AMs will be used for all the queries in that
pool.

The actual Tez containers are “replaced” by LLAP daemons that are always
running so there’s no start up cost as well. The underline execution engine
is still Tez but it is executed in a special LLAP mode and this could
potentially give you sub second response time.

In my experience, when Hive LLAP is used, IO cache is enabled and the file
format is ORC, I can get under 1s for small queries when the cage is hit
(equivalent to in-memory database at at time). Parquet is slower since the
LLAP mode doesn’t support efficient IO caching and vectorized execution.


On Mon, Apr 16, 2018 at 9:33 AM Anup Tiwari  wrote:

> Hi All,
>
> We have a use case where we need to return output in < 10 sec. We have
> evaluated different set of tool for execution and they work find but they
> do not cover all cases as well as they are not reliable(since they are in
> evolving phase). But Hive works well in this context.
>
> Using Hive LLAP, we have reduced query time to 6-7sec. But query launching
> takes ~12-15 sec due to which response time becomes 18-21 sec.
>
> Is there any way we can reduce this launching time?
>
> Please note that we have tried prewarm containers but when we are
> launching query from hive client then it is not picking containers from
> already initialized containers rather it launches its own.
>
> Please let me know how can we overcome this issue since this is the only
> problem which is stopping us from using Hive. Any links/description is
> really appreciated.
>
>
> Regards,
> Anup Tiwari
>
-- 
Thai


Ways to reduce launching time of query in Hive 2.2.1

2018-04-16 Thread Anup Tiwari
Hi All,

We have a use case where we need to return output in < 10 sec. We have
evaluated different set of tool for execution and they work find but they
do not cover all cases as well as they are not reliable(since they are in
evolving phase). But Hive works well in this context.

Using Hive LLAP, we have reduced query time to 6-7sec. But query launching
takes ~12-15 sec due to which response time becomes 18-21 sec.

Is there any way we can reduce this launching time?

Please note that we have tried prewarm containers but when we are launching
query from hive client then it is not picking containers from already
initialized containers rather it launches its own.

Please let me know how can we overcome this issue since this is the only
problem which is stopping us from using Hive. Any links/description is
really appreciated.


Regards,
Anup Tiwari


Hive Server2 JDBC create connection pool and add proxy user into different session

2018-04-16 Thread ran gabriele
Hello,

I am new to hive JDBC and I tried to copy code from hive server2 jdbc client 
example and nifi.

Now I am able to create hive connection with proxy user indicated in URL.
However that means I have to create a connection pool for every user if they 
have multiple sessions.
I want to know that is there any way to create a connection pool without proxy 
user option first, and they add proxy user to it if some user would like to 
make a session to my jdbc connector?

Thank you!



Sent from ranmx