DataSourceV2 sync notes - 10 July 2019

2019-07-19 Thread Ryan Blue
Here are my notes from the last sync. If you’d like to be added to the
invite or have topics, please let me know.

*Attendees*:

Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer

*Topics*:

   - Existing PRs
  - V2 session catalog: https://github.com/apache/spark/pull/24768
  - REPLACE and RTAS: https://github.com/apache/spark/pull/24798
  - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
  - ALTER TABLE: https://github.com/apache/spark/pull/24937
  - INSERT INTO: https://github.com/apache/spark/pull/24832
   - Stats integration
   - CTAS and DataFrameWriter behavior

*Discussion*:

   - ALTER TABLE PR is ready to commit (and was after the sync)
   - REPLACE and RTAS PR: waiting on more reviews
   - INSERT INTO PR: Ryan will review
   - DESCRIBE TABLE has test failures, Matt will fix
   - V2 session catalog:
  - How will v2 catalog be configured?
  - Ryan: This is up for discussion because it currently uses a table
  property. I think it needs to be configurable
  - Burak: Agree that it should be configurable
  - Ryan: Does this need to be determined now, or can we solve this
  after getting the functionality in?
  - Jose: let’s get it in and fix it later
   - Stats integration:
  - Matt: has anyone looked at stats integration? What needs to be done?
  - Ryan: stats are part of the Scan API. Configure a scan with
  ScanBuilder and then get stats from it. The problem is that this happens
  when converting to physical plan, after the optimizer. But the optimizer
  determines what gets broadcasted. A work-around Netflix uses is
to run push
  down in the stats code. This runs push-down twice and was rejected from
  Spark, but is important for performance. We should add a
property to enable
  this.
  - Ryan: The larger problem is that stats are used in the optimizer,
  but push-down happens when converting to physical plan. This is also
  related to our earlier discussions about when join types are
chosen. Fixing
  this is a big project
   - CTAS and DataFrameWriter behavior
  - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
  difficult to predict v1 behavior
  - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
  suggest a replacement with clear verbs for each SQL action:
append/insert,
  overwrite, overwriteDynamic, create (table), replace (table)
  - Ryan: Prototype available here:
  https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f

-- 
Ryan Blue
Software Engineer
Netflix


Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-07-19 Thread Hyukjin Kwon
That's a great explanation. Thanks I didn't know that.

Josh, do you know who I should ping on this?

On Fri, 19 Jul 2019, 16:52 Dongjoon Hyun,  wrote:

> Hi, Hyukjin.
>
> In short, there are two bots. And, the current situation happens when only
> one bot with `dev/github_jira_sync.py` works.
>
> And, `dev/github_jira_sync.py` is irrelevant to the JIRA status change
> because it only use `add_remote_link` and `add_comment` API.
> I know only this bot (in Apache Spark repository repo)
>
> AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
> `githubbot` (Name: `ASF GitHub Bot`).
> And, the other bot's activity is done under JIRA ID `apachespark` (Name:
> `Apache Spark`).
> The other bot is the one which Josh mentioned before. (in
> `databricks/spark-pr-dashboard` repo).
>
> The root cause will be the same. The API key used by the bot is rejected
> by Apache JIRA and forwarded to CAPCHAR.
>
> Bests,
> Dongjoon.
>
> On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> Seems this issue is re-happening again. Seems the PR link is properly
>> created in the corresponding JIRA but it doesn't change the JIRA's status
>> from OPEN to IN-PROGRESS.
>>
>> See, for instance,
>>
>> https://issues.apache.org/jira/browse/SPARK-28443
>> https://issues.apache.org/jira/browse/SPARK-28440
>> https://issues.apache.org/jira/browse/SPARK-28436
>> https://issues.apache.org/jira/browse/SPARK-28434
>> https://issues.apache.org/jira/browse/SPARK-28433
>> https://issues.apache.org/jira/browse/SPARK-28431
>>
>> Josh and Dongjoon, do you guys maybe have any idea?
>>
>> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>>
>>> Thank you so much Josh .. !!
>>>
>>> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>>>
 The code for this runs in http://spark-prs.appspot.com (see
 https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
 )

 I checked the AppEngine logs and it looks like we're getting error
 responses, possibly due to a credentials issue:

 Exception when starting progress on JIRA issue SPARK-27355 (
> /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
> )
> Traceback (most recent call last): File
> Traceback (most recent call last):
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
> line 138
> ,
> in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
> issue_number)) File
> start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
> issue_number))
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
> line 27
> ,
> in start_issue_progress jira_client = get_jira_client() File
> jira_client = get_jira_client()
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
> line 18
> ,
> in get_jira_client app.config['JIRA_PASSWORD'])) File
> app.config['JIRA_PASSWORD']))
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
> line 472
> ,
> in __init__ si = self.server_info() File
> si = self.server_info()
> File 
> "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
> line 2133
> ,
> in server_info j = self._get_json('serverInfo') File
> j = self._get_json('serverInfo')
> File 
> 

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2019-07-19 Thread Dongjoon Hyun
Hi, Hyukjin.

In short, there are two bots. And, the current situation happens when only
one bot with `dev/github_jira_sync.py` works.

And, `dev/github_jira_sync.py` is irrelevant to the JIRA status change
because it only use `add_remote_link` and `add_comment` API.
I know only this bot (in Apache Spark repository repo)

AFAIK, `deb/github_jira_sync.py`'s activity is done under JIRA ID
`githubbot` (Name: `ASF GitHub Bot`).
And, the other bot's activity is done under JIRA ID `apachespark` (Name:
`Apache Spark`).
The other bot is the one which Josh mentioned before. (in
`databricks/spark-pr-dashboard` repo).

The root cause will be the same. The API key used by the bot is rejected by
Apache JIRA and forwarded to CAPCHAR.

Bests,
Dongjoon.

On Thu, Jul 18, 2019 at 8:24 PM Hyukjin Kwon  wrote:

> Hi all,
>
> Seems this issue is re-happening again. Seems the PR link is properly
> created in the corresponding JIRA but it doesn't change the JIRA's status
> from OPEN to IN-PROGRESS.
>
> See, for instance,
>
> https://issues.apache.org/jira/browse/SPARK-28443
> https://issues.apache.org/jira/browse/SPARK-28440
> https://issues.apache.org/jira/browse/SPARK-28436
> https://issues.apache.org/jira/browse/SPARK-28434
> https://issues.apache.org/jira/browse/SPARK-28433
> https://issues.apache.org/jira/browse/SPARK-28431
>
> Josh and Dongjoon, do you guys maybe have any idea?
>
> 2019년 4월 25일 (목) 오후 3:09, Hyukjin Kwon 님이 작성:
>
>> Thank you so much Josh .. !!
>>
>> 2019년 4월 25일 (목) 오후 3:04, Josh Rosen 님이 작성:
>>
>>> The code for this runs in http://spark-prs.appspot.com (see
>>> https://github.com/databricks/spark-pr-dashboard/blob/1e799c9e510fa8cdc9a6c084a777436bebeabe10/sparkprs/controllers/tasks.py#L137
>>> )
>>>
>>> I checked the AppEngine logs and it looks like we're getting error
>>> responses, possibly due to a credentials issue:
>>>
>>> Exception when starting progress on JIRA issue SPARK-27355 (
 /base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py:142
 )
 Traceback (most recent call last): File
 Traceback (most recent call last):
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/controllers/tasks.py",
 line 138
 ,
 in update_pr start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
 issue_number)) File
 start_issue_progress("%s-%s" % (app.config['JIRA_PROJECT'],
 issue_number))
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
 line 27
 ,
 in start_issue_progress jira_client = get_jira_client() File
 jira_client = get_jira_client()
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/sparkprs/jira_api.py",
 line 18
 ,
 in get_jira_client app.config['JIRA_PASSWORD'])) File
 app.config['JIRA_PASSWORD']))
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
 line 472
 ,
 in __init__ si = self.server_info() File
 si = self.server_info()
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
 line 2133
 ,
 in server_info j = self._get_json('serverInfo') File
 j = self._get_json('serverInfo')
 File 
 "/base/data/home/apps/s~spark-prs/live.412416057856832734/lib/jira/client.py",
 line 2549
 

Spark and Oozie

2019-07-19 Thread Dennis Suhari


Dear experts,

I am using Spark for processing data from HDFS (hadoop). These Spark 
application are data pipelines, data wrangling and machine learning 
applications. Thus Spark submits its job using YARN. 
This also works well. For scheduling I am now trying to use Apache Oozie, but I 
am facing performqnce impacts. A Spark job which tooks 44 seconds when 
submitting it via CLI now takes nearly 3 Minutes.

Have you faced similar experiences in using Oozie for scheduling Spark 
application jobs ? What alternative workflow tools are you using for scheduling 
Spark jobs on Hadoop ?


Br,

Dennis

Von meinem iPhone gesendet

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org