Re: Ask for ARM CI for spark

2019-08-02 Thread bo zhaobo
Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one
can tell us how to trigger that tests? When? How?  So far, I haven't
notices how it works.

Thanks

Best Regards,

ZhaoBo



[image: Mailtrack]

Sender
notified by
Mailtrack

19/08/02
下午05:37:30

bo zhaobo  于2019年7月31日周三 上午11:56写道:

> Hi, team.
> I want to make the same test on ARM like existing CI does(x86). As
> building and testing the whole spark projects will cost too long time, so I
> plan to split them to multiple jobs to run for lower time cost. But I
> cannot see what the existing CI[1] have done(so many private scripts
> called), so could any CI maintainers help/tell us for how to split them and
> the details about different CI jobs does? Such as PR title contains [SQL],
> [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib],
> [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run
> the different CI job.
>
> @shane knapp,
> Oh, sorry for disturb. I found your email looks like from 'berkeley.edu',
> are you the good guy who we are looking for help about this? ;-)
> If so, could you give some helps or advices? Thank you.
>
> Thank you very much,
>
> Best Regards,
>
> ZhaoBo
>
> [1] https://amplab.cs.berkeley.edu/jenkins
>
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/07/31
> 上午11:53:36
>
> Tianhua huang  于2019年7月29日周一 上午9:38写道:
>
>> @Sean Owen   Thank you very much. And I saw your reply
>> comment in https://issues.apache.org/jira/browse/SPARK-28519, I will
>> test with modification and to see whether there are other similar tests
>> fail, and will address them together in one pull request.
>>
>> On Sat, Jul 27, 2019 at 9:04 PM Sean Owen  wrote:
>>
>>> Great thanks - we can take this to JIRAs now.
>>> I think it's worth changing the implementation of atanh if the test
>>> value just reflects what Spark does, and there's evidence is a little bit
>>> inaccurate.
>>> There's an equivalent formula which seems to have better accuracy.
>>>
>>> On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Hi, all,

 FYI:
 >> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
 >> Interesting if it also returns the same less accurate result, which
 >> might suggest it's more to do with underlying OS math libraries. You
 >> noted that these tests sometimes gave platform-dependent differences
 >> in the last digit, so wondering if the test value directly reflects
 >> PostgreSQL or just what we happen to return now.

 The results in float8.sql.out were recomputed in Spark/JVM.
 The expected output of the PostgreSQL test is here:
 https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493

 As you can see in the file (float8.out), the results other than atanh
 also are different between Spark/JVM and PostgreSQL.
 For example, the answers of acosh are:
 -- PostgreSQL

 https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L487
 1.31695789692482

 -- Spark/JVM

 https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/pgSQL/float8.sql.out#L523
 1.3169578969248166

 btw, the PostgreSQL implementation for atanh just calls atanh in
 math.h:

 https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L2606

 Bests,
 Takeshi




Re: Recognizing non-code contributions

2019-08-02 Thread Sean Owen
Yes, there's an interesting idea that came up on members@: should
there be a status in Spark that doesn't include the commit bit or
additional 'rights', but is formally recognized by the PMC? An MVP,
VIP, Knight of the Apache Foo project. I don't think any other project
does this, but don't think it's _prohibited_. This could recognize
long-time contributors of any kind (code, docs, external) for whom
it's not yet right to make a committer. The point would be
recognition, because there's really nothing short of that the project
can formally do. I personally am not sure it adds enough to justify
the process, and may wade too deeply into controversies about whether
this is just extra gatekeeping vs something helpful.

On Thu, Aug 1, 2019 at 11:09 PM Sean Owen  wrote:
>
> (Let's move this thread to dev@ now as it is a general and important
> community question. This was requested on members@)
>
> On Thu, Aug 1, 2019 at 10:20 PM Matei Zaharia  wrote:
> >
> > Our text on becoming a committer already says that we want committers who 
> > focus on our docs: https://spark.apache.org/committers.html. Just working 
> > on docs / books outside the project doesn’t make as much sense IMO.
> >
> > Matei
> >
> > On Aug 1, 2019, at 8:10 PM, Reynold Xin  wrote:
> >
> > Not sure if ASF allows this but we can start some sort of “MVP” program 
> > recognizing non-code contributors...

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for ARM CI for spark

2019-08-02 Thread shane knapp
i'm out of town, but will answer some of your questions next week.

On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo 
wrote:

>
> Hi Team,
>
> Any updates about the CI details? ;-)
>
> Also, I will also need your kind help about Spark QA test, could any one
> can tell us how to trigger that tests? When? How?  So far, I haven't
> notices how it works.
>
> Thanks
>
> Best Regards,
>
> ZhaoBo
>
>
>
> [image: Mailtrack]
> 
>  Sender
> notified by
> Mailtrack
> 
>  19/08/02
> 下午05:37:30
>
> bo zhaobo  于2019年7月31日周三 上午11:56写道:
>
>> Hi, team.
>> I want to make the same test on ARM like existing CI does(x86). As
>> building and testing the whole spark projects will cost too long time, so I
>> plan to split them to multiple jobs to run for lower time cost. But I
>> cannot see what the existing CI[1] have done(so many private scripts
>> called), so could any CI maintainers help/tell us for how to split them and
>> the details about different CI jobs does? Such as PR title contains [SQL],
>> [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib],
>> [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run
>> the different CI job.
>>
>> @shane knapp,
>> Oh, sorry for disturb. I found your email looks like from 'berkeley.edu',
>> are you the good guy who we are looking for help about this? ;-)
>> If so, could you give some helps or advices? Thank you.
>>
>> Thank you very much,
>>
>> Best Regards,
>>
>> ZhaoBo
>>
>> [1] https://amplab.cs.berkeley.edu/jenkins
>>
>>
>>
>>
>> [image: Mailtrack]
>> 
>>  Sender
>> notified by
>> Mailtrack
>> 
>>  19/07/31
>> 上午11:53:36
>>
>> Tianhua huang  于2019年7月29日周一 上午9:38写道:
>>
>>> @Sean Owen   Thank you very much. And I saw your
>>> reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I
>>> will test with modification and to see whether there are other similar
>>> tests fail, and will address them together in one pull request.
>>>
>>> On Sat, Jul 27, 2019 at 9:04 PM Sean Owen  wrote:
>>>
 Great thanks - we can take this to JIRAs now.
 I think it's worth changing the implementation of atanh if the test
 value just reflects what Spark does, and there's evidence is a little bit
 inaccurate.
 There's an equivalent formula which seems to have better accuracy.

 On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <
 linguin@gmail.com> wrote:

> Hi, all,
>
> FYI:
> >> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
> >> Interesting if it also returns the same less accurate result, which
> >> might suggest it's more to do with underlying OS math libraries. You
> >> noted that these tests sometimes gave platform-dependent differences
> >> in the last digit, so wondering if the test value directly reflects
> >> PostgreSQL or just what we happen to return now.
>
> The results in float8.sql.out were recomputed in Spark/JVM.
> The expected output of the PostgreSQL test is here:
> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493
>
> As you can see in the file (float8.out), the results other than atanh
> also are different between Spark/JVM and PostgreSQL.
> For example, the answers of acosh are:
> -- PostgreSQL
>
> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L487
> 1.31695789692482
>
> -- Spark/JVM
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/pgSQL/float8.sql.out#L523
> 1.3169578969248166
>
> btw, the PostgreSQL implementation for atanh just calls atanh in
> math.h:
>
> https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L2606
>
> Bests,
> Takeshi
>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


DataSourceV2 : Transactional Write support

2019-08-02 Thread Shiv Prashant Sood
 All,

I understood that DataSourceV2 supports Transactional write and wanted to
implement that in JDBC DataSource V2 connector ( PR#25211
 ).

Don't see how this is feasible for JDBC based connector.  The FW suggest
that EXECUTOR send a commit message  to DRIVER, and actual commit should
only be done by DRIVER after receiving all commit confirmations. This will
not work for JDBC  as commits have to happen on the JDBC Connection which
is maintained by the EXECUTORS and JDBCConnection  is not serializable that
it can be sent to the DRIVER.

Am i right in thinking that this cannot be supported for JDBC? My goal is
to either fully write or roll back the dataframe write operation.

Thanks in advance for your help.

Regards,
Shiv


Python API for mapGroupsWithState

2019-08-02 Thread Nicholas Chammas
Can someone succinctly describe the challenge in adding the
`mapGroupsWithState()` API to PySpark?

I was hoping for some suboptimal but nonetheless working solution to be
available in Python, as there are with Python UDFs for example, but that
doesn't seem to be case. The JIRA ticket for arbitrary stateful operations
in Structured Streaming 
doesn't give any indication that a Python version of the API is coming.

Is this something that will likely be added in the near future, or is it a
major undertaking? Can someone briefly describe the problem?

Nick


Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-02 Thread Matt Cheah
I agree that having both modes and let the user choose the one he/she wants is 
the best option (I don't see big arguments on this honestly). Once we have 
this, I don't see big differences on what is the default. What - I think - we 
still have to work on, is to go ahead with the "strict mode" work and provide a 
more convenient way for users to switch among the 2 options. I mean: currently 
we have one flag for throwing exception on overflow for operations on decimals, 
one for doing the same for operations on other data types and probably going 
ahead we will have more. I think in the end we will need to collect them all 
under an "umbrella" flag which lets the user simply switch between strict and 
non-strict mode. I also think that we will need to document this very well and 
give it particular attention in our docs, maybe with a dedicated section, in 
order to provide enough visibility on it to end users.

 

I’m +1 on adding a strict mode flag this way, but I’m undecided on whether or 
not we want a separate flag for each of the arithmetic overflow situations that 
could produce invalid results. My intuition is yes, because different users 
have different levels of tolerance for different kinds of errors. I’d expect 
these sorts of configurations to be set up at an infrastructure level, e.g. to 
maintain consistent standards throughout a whole organization.

 

From: Gengliang Wang 
Date: Thursday, August 1, 2019 at 3:07 AM
To: Marco Gaido 
Cc: Wenchen Fan , Hyukjin Kwon , 
Russell Spitzer , Ryan Blue , 
Reynold Xin , Matt Cheah , Takeshi 
Yamamuro , Spark dev list 
Subject: Re: [Discuss] Follow ANSI SQL on table insertion

 

Hi all,

 

Let me explain a little bit on the proposal.

By default, we follow the store assignment rules in table insertion. On invalid 
casting, the result is null. It's better than the behavior in Spark 2.x while 
keeping backward-compatibility. It is 

If users can't torrent the silently corrupting, they can enable the new mode 
which throws runtime exceptions.

The proposal itself is quite complete. It satisfies different users to some 
degree.

 

It is hard to avoid null in data processing anyway. For example, 

> select 2147483647 + 1

2147483647 is the max value of Int. And the result data type of pulsing two 
integers are supposed to be Integer type. Since the value of (2147483647 + 1) 
can't fit into Int, I think Spark return null or throw runtime exceptions in 
such case. (Someone can argue that we can always convert the result as wider 
types, but that's another topic about performance and DBMS behaviors)

 

So, give a table t with an Int column, checking data type with Up-Cast can't 
avoid possible null values in the following SQL, as the result data type of 
(int_column_a + int_column_b) is int type.

>  insert into t select int_column_a + int_column_b from tbl_a, tbl_b;

 

Furthermore, if Spark uses Up-Cast and a user's existing ETL job failed because 
of that, what should he/she do then? I think he/she will try adding "cast" to 
queries first. Maybe a project for unifying data schema over all data sources 
has to be done later on if he/she has enough resource. The upgrade can be 
painful because of the strict rules of Up-Cast, while the user scenario might 
be able to tolerate converting Double to Decimal, or Timestamp to Date. 

 

 

Gengliang

 

On Thu, Aug 1, 2019 at 4:55 PM Marco Gaido  wrote:

Hi all, 

 

I agree that having both modes and let the user choose the one he/she wants is 
the best option (I don't see big arguments on this honestly). Once we have 
this, I don't see big differences on what is the default. What - I think - we 
still have to work on, is to go ahead with the "strict mode" work and provide a 
more convenient way for users to switch among the 2 options. I mean: currently 
we have one flag for throwing exception on overflow for operations on decimals, 
one for doing the same for operations on other data types and probably going 
ahead we will have more. I think in the end we will need to collect them all 
under an "umbrella" flag which lets the user simply switch between strict and 
non-strict mode. I also think that we will need to document this very well and 
give it particular attention in our docs, maybe with a dedicated section, in 
order to provide enough visibility on it to end users.

 

Thanks,

Marco

 

Il giorno gio 1 ago 2019 alle ore 09:42 Wenchen Fan  ha 
scritto:

Hi Hyukjin, I think no one here is against the SQL standard behavior, which is 
no corrupted data + runtime exception. IIUC the main argument here is: shall we 
still keep the existing "return null for invalid operations" behavior as 
default? 

 

Traditional RDBMS is usually used as the final destination of CLEAN data. It's 
understandable that they need high data quality and they try their best to 
avoid corrupted data at any cost.

 

However, Spark is different. AFAIK Spark is usually used as an ETL tool, which 
needs to deal with DIRTY data. It's 

Re: DataSourceV2 : Transactional Write support

2019-08-02 Thread Matt Cheah
Can we check that the latest staging APIs work for the JDBC use case in a 
single transactional write? See 
https://github.com/apache/spark/pull/24798/files#diff-c9d2f9c9d20452939b7c28ebdae0503dR53

 

But also acknowledge that transactions from a more traditional RDBMS sense tend 
to have pretty specific semantics we don’t support in the V2 API. For example, 
one cannot commit multiple write operations in a single transaction right now. 
That would require changes to the DDL and a pretty substantial change to the 
design of Spark-SQL more broadly.

 

-Matt Cheah

 

From: Shiv Prashant Sood 
Date: Friday, August 2, 2019 at 12:56 PM
To: Spark Dev List 
Subject: DataSourceV2 : Transactional Write support

 

All,

 

I understood that DataSourceV2 supports Transactional write and wanted to  
implement that in JDBC DataSource V2 connector ( PR#25211 [github.com] ). 

 

Don't see how this is feasible for JDBC based connector.  The FW suggest that 
EXECUTOR send a commit message  to DRIVER, and actual commit should only be 
done by DRIVER after receiving all commit confirmations. This will not work for 
JDBC  as commits have to happen on the JDBC Connection which is maintained by 
the EXECUTORS and JDBCConnection  is not serializable that it can be sent to 
the DRIVER. 

 

Am i right in thinking that this cannot be supported for JDBC? My goal is to 
either fully write or roll back the dataframe write operation.

 

Thanks in advance for your help.

 

Regards, 

Shiv



smime.p7s
Description: S/MIME cryptographic signature


Re: DataSourceV2 : Transactional Write support

2019-08-02 Thread Jungtaek Lim
I asked similar question for end-to-end exactly-once with Kafka, and you're
correct distributed transaction is not supported. Introducing distributed
transaction like "two-phase commit" requires huge change on Spark codebase
and the feedback was not positive.

What you could try instead is intermediate output: inserting into temporal
table in executors, and move inserted records to the final table in driver
(must be atomic).

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood 
wrote:

> All,
>
> I understood that DataSourceV2 supports Transactional write and wanted to
> implement that in JDBC DataSource V2 connector ( PR#25211
>  ).
>
> Don't see how this is feasible for JDBC based connector.  The FW suggest
> that EXECUTOR send a commit message  to DRIVER, and actual commit should
> only be done by DRIVER after receiving all commit confirmations. This will
> not work for JDBC  as commits have to happen on the JDBC Connection which
> is maintained by the EXECUTORS and JDBCConnection  is not serializable that
> it can be sent to the DRIVER.
>
> Am i right in thinking that this cannot be supported for JDBC? My goal is
> to either fully write or roll back the dataframe write operation.
>
> Thanks in advance for your help.
>
> Regards,
> Shiv
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior