Re: Apache Griffin咨询

2019-03-09 Thread Nick Sokolov
For 1) and 2), alerting based on metrics and profiling graphs, Grafana
works pretty good on my experience.

Problem in 4) can be solved in several ways:
 - "predicates": it is possible to configure data source with predicate
logic, which will be checked before job gets started. However there are few
gotchas. Predicate allows only to "skip" executions if predicate returned
false. If it's once a day job - it won't run that day at all, if data is
not available. Another problem is that only "file.exist" predicate is
supported out of the box, and 0.4.0 does not allow to provide custom ones.
But there is PR open  adding
ability to write custom predicates (for example, you can create predicate,
checking whether hive table have changed since last run).
 - trigger job execution from job itself, either using GRIFFIN-229 (yet to
be merged), or by running griffin-measure via spark-submit explicitly

On Thu, Mar 7, 2019 at 5:58 AM William Guo  wrote:

> hi team,
>
> We translate the origin email into english so the community can understand
> it.
>
> =
> Hello, I recently deployed Griffin on my own computer (version: 0.4.0). I
> have a few questions to ask:
> (1) I saw the Griffin document saying that you can customize the measure
> and set the alarm threshold, but I saw the Griffin source code and its UI,
> did not find the entry to set the alarm threshold, and did not find the
> service to send the alarm mail, not even Find the parameters to configure
> the mail server, maybe I did not notice the details of this, trouble
> telling how to configure the alarm threshold for data quality monitoring,
> and how to change the address of the mail sending server? thank.
>
>
> (2) I use Profiling to monitor a field in a table. Metrics is a table, not
> a chart. Can it be configured as a chart?
>
>
> (3) I saw the source code. I think Griffin's logic is: first create
> Measure, Job, which will generate a scheduled task when creating a job.
> This scheduled task will periodically pass to Spark according to the set
> scheduling time. Livy) initiates a scheduling request. After Spark is
> executed, the data quality results are stored in hdfs and ES. On the
> Griffin Web, each time the user queries Metrics, it will go to the ES to
> pull the corresponding data quality result. Is this my understanding
> correct?
>
>
> (4) Do you have a good solution to the problem of task dependence? For
> example: I have a Hive table, it is updated once a day, every day after the
> update will generate a new time partition (dt partition field = yesterday's
> date, for example: dt = 2019-03-03), when the processing task of this table
> is completed After updating the table data, restart Griffin's corresponding
> data quality monitoring task to verify that the data in the new partition
> meets the requirements. Because if the Griffin task does not depend on the
> processing task of the Hive table, then the Griffin task will continue to
> execute. If the Hive table processing task has not yet started, then the
> partition will have no data yesterday, then it will always receive the
> alarm mail, so Will lead to a lot of unnecessary "false alarms". So please
> also trouble to see how you solved this problem?
>
>
> Trouble, when you have time, help answer the above questions, thank you.
>
>
> One of Apache open source enthusiasts
> 2019.03.04
>
>
>
> On Tue, Mar 5, 2019 at 9:57 AM 大鹏 <18210146...@163.com> wrote:
>
> > 部分问题(按问题编号)我所了解的情况如下:
> > (1)目前报警需要结合ES来实现,ES有相关的报警插件;
> > (2)目前不支持自定义图表,只能根据自己的需求开发相应的图表;
> > (3)你的理解是对的
> >
> >
> > 希望对你有所帮助
> >
> >
> > 在2019年03月5日 07:53,李立威<412947...@qq.com> 写道:
> > hello,我最近在自己的电脑上部署Griffin(版本: 0.4.0),有几个问题想咨询一下:
> >
> >
> (1)我看Griffin文档上说,可以自定义measure,并设置报警阈值,但我看了Griffin源代码及其UI,没有找到设置报警阈值的入口,也没有找到发送报警邮件的服务,更没有找到配置邮件服务器的参数,可能是我没有注意到这个的细节,麻烦告知一下怎么配置数据质量监控的报警阈值,以及如何更改邮件发送服务器的地址?感谢。
> >
> >
> > (2)我使用Profiling对某个表的某个字段进行监控,统计出来Metrics是一个表格,而不是图表,能否将其配置为图表呢?
> >
> >
> >
> (3)我看了源代码,我觉得Griffin的逻辑是:先创建Measure、Job,其中在创建Job时,会生成定时任务,这个定时任务会按设置的调度时间,周期性的向Spark(通过Livy)发起调度请求,Spark执行完成后,将数据质量结果存储在hdfs和ES中,Griffin
> > Web端,用户每次查询Metrics时,会去ES中拉取相应的数据质量结果。我这种理解对不对呢?
> >
> >
> >
> >
> (4)关于任务依赖的问题,你们有没有好的解决方案呢?比如:我有个Hive表,它每天更新一次,每天更新后都会生成一个新的时间分区(dt分区字段=昨天的日期,例如:dt=2019-03-03),当这个表的加工任务完成更新表数据后,再启动Griffin相应的数据质量监控任务去校验新分区中的数据是否符合要求。因为如果Griffin任务不依赖Hive表的加工任务,那么Griffin任务就会不断执行,如果Hive表加工任务还未开始执行,那么昨天的分区就会一直没有数据,那么就会一直收到报警邮件,这样会导致很多不必要的“假报警”。所以还请麻烦看看你们是如何解决这个问题的?
> >
> >
> > 麻烦你在有空的时候,帮忙解答一下上面的问题,感谢。
> >
> >
> > Apache 开源爱好者之一
> > 2019.03.04
>


Re: 咨询问题

2019-03-09 Thread Nick Sokolov
I think, if there is clear use case, it makes sense to document that in
Jira.

For 1), it sounds like it requires some "state" to be preserved between job
runs, something which is not available directly. This theoretically can be
done with elasticsearch input reading latest previous result of the job,
and then spark-sql computing difference between old and new value. This
requires either custom elasticsearch input, or implementation of GRIFFIN-214
 in upstream.

For 2), are you talking about profiling publishing MIN/MAX, or something
more sophisticated?

On Thu, Mar 7, 2019 at 6:11 AM William Guo  wrote:

> Hi all,
>
> I translate the original email into english as following.
>
> =
> Whether the following requirements will be considered in the griffin later
> versions:
>
>
>1. monitoring year-over-year/month-over-month growth for some columns
>2. monitoring column value change range
>
>
> Thank you
> =
>
>
>
> On Wed, Mar 6, 2019 at 3:41 PM 736723...@qq.com <736723...@qq.com> wrote:
>
> > griffin后期版本迭代中是否会考虑以下需求:
> >
> > 同一个表某一列或者几列的值的同步/环比监控。
> > 列的值域监控
> >
> > 谢谢~
> >
> >
> >
> > 736723...@qq.com
> >
>


[jira] [Work logged] (GRIFFIN-237) [Jobs] Implement service method getting JobInstanceBean by JobInstanceBean id

2019-03-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-237?focusedWorklogId=210621=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210621
 ]

ASF GitHub Bot logged work on GRIFFIN-237:
--

Author: ASF GitHub Bot
Created on: 10/Mar/19 03:31
Start Date: 10/Mar/19 03:31
Worklog Time Spent: 10m 
  Work Description: chemikadze commented on pull request #486: GRIFFIN-237 
Implement service method get JobInstanceBean by id
URL: https://github.com/apache/griffin/pull/486#discussion_r264022201
 
 

 ##
 File path: 
service/src/main/java/org/apache/griffin/core/job/repo/JobInstanceRepo.java
 ##
 @@ -34,6 +34,9 @@ Licensed to the Apache Software Foundation (ASF) under one
 
 JobInstanceBean findByPredicateName(String name);
 
+@Query("select s from JobInstanceBean s where s.id = ?1")
 
 Review comment:
   is there an index by `id` field?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 210621)
Time Spent: 20m  (was: 10m)

> [Jobs] Implement service method getting JobInstanceBean by JobInstanceBean id
> -
>
> Key: GRIFFIN-237
> URL: https://issues.apache.org/jira/browse/GRIFFIN-237
> Project: Griffin (Incubating)
>  Issue Type: New Feature
>Reporter: Dmitry Ershov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Getting particular entity by id is very natural part of any API.Together with 
> GRIFFIN-229 it would allow to wait for job
>  completion, unlocking lots of potential use cases.
>  
> h2. Server
> Provide a RESTful GET method to  _/api/v1//jobs/instances/\{jobInstanceId}_ 
> returning JobInstanceBean for specified jobInstanceBean id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GRIFFIN-229) [Jobs] trigger the job right now

2019-03-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-229?focusedWorklogId=210623=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210623
 ]

ASF GitHub Bot logged work on GRIFFIN-229:
--

Author: ASF GitHub Bot
Created on: 10/Mar/19 03:38
Start Date: 10/Mar/19 03:38
Worklog Time Spent: 10m 
  Work Description: chemikadze commented on issue #485: GRIFFIN-229 trigger 
the job right now with fixing comments
URL: https://github.com/apache/griffin/pull/485#issuecomment-471244024
 
 
   LGTM
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 210623)
Time Spent: 1.5h  (was: 1h 20m)

> [Jobs] trigger the job right now
> 
>
> Key: GRIFFIN-229
> URL: https://issues.apache.org/jira/browse/GRIFFIN-229
> Project: Griffin (Incubating)
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Zhen Li
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> h1. Add "trigger now" button on Jobs page
> h2. UI
> Add "trigger now" button just after "stop" button. When User press "trigger 
> now" button, it will send a HTTP GET request to server, trigger the Job right 
> now.
> h2. Server
> Provide a new RESTful API _/api/v1//jobs/trigger/\{id},_ create a start now 
> trigger to schedule the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GRIFFIN-237) [Jobs] Implement service method getting JobInstanceBean by JobInstanceBean id

2019-03-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GRIFFIN-237?focusedWorklogId=210616=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210616
 ]

ASF GitHub Bot logged work on GRIFFIN-237:
--

Author: ASF GitHub Bot
Created on: 10/Mar/19 00:29
Start Date: 10/Mar/19 00:29
Worklog Time Spent: 10m 
  Work Description: jhsb25 commented on pull request #486: GRIFFIN-237 
Implement service method get JobInstanceBean by id
URL: https://github.com/apache/griffin/pull/486
 
 
   Implement service method getting JobInstanceBean by JobInstanceBean id
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 210616)
Time Spent: 10m
Remaining Estimate: 0h

> [Jobs] Implement service method getting JobInstanceBean by JobInstanceBean id
> -
>
> Key: GRIFFIN-237
> URL: https://issues.apache.org/jira/browse/GRIFFIN-237
> Project: Griffin (Incubating)
>  Issue Type: New Feature
>Reporter: Dmitry Ershov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting particular entity by id is very natural part of any API.Together with 
> GRIFFIN-229 it would allow to wait for job
>  completion, unlocking lots of potential use cases.
>  
> h2. Server
> Provide a RESTful GET method to  _/api/v1//jobs/instances/\{jobInstanceId}_ 
> returning JobInstanceBean for specified jobInstanceBean id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)