Re: Moving forward with the timestamp proposal

2019-02-20 Thread Wenchen Fan
I think this is the right direction to go, but I'm wondering how can Spark
support these new types if the underlying data sources(like parquet files)
do not support them yet.

I took a quick look at the new doc for file formats, but not sure what's
the proposal. Are we going to implement these new types in Parquet/Orc
first? Or are we going to use low-level physical types directly and add
Spark-specific metadata to Parquet/Orc files?

On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi 
wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
>  with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>- The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>semantics.
>- The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>semantics.
>- The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread Felix Cheung
Could you hold for a bit - I have one more fix to get in



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, February 20, 2019 12:25 PM
To: Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
>
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
>
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>>
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>>
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>>
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Unsubscribe

2019-02-20 Thread William Shen
Please send an email to dev-unsubscr...@spark.apache.org to unsubscribe.
You should receive an email with instruction to confirm the unsubscribe.

On Wed, Feb 20, 2019 at 3:58 PM Reena Agrawal  wrote:

> Unsubscribe pls.
>


Unsubscribe

2019-02-20 Thread Reena Agrawal
Unsubscribe pls.


Re: Thoughts on dataframe cogroup?

2019-02-20 Thread Li Jin
Alessandro,

Thanks for the reply. I assume by "equi-join", you mean "equality  full
outer join" .

Two issues I see with equity outer join is:
(1) equity outer join will give n * m rows for each key (n and m being the
corresponding number of rows in df1 and df2 for each key)
(2) User needs to do some extra processing to transform n * m back to the
desired shape (two sub dataframes with n and m rows)

I think full outer join is an inefficient way to implement cogroup. If the
end goal is to have two separate dataframes for each key, why joining them
first and then unjoin them?



On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello,
> I fail to see how an equi-join on the key columns is different than the
> cogroup you propose.
>
> I think the accepted answer can shed some light:
>
> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark
>
> Now you apply an udf on each iterable, one per key value (obtained with
> cogroup).
>
> You can achieve the same by:
> 1) join df1 and df2 on the key you want,
> 2) apply "groupby" on such key
> 3) finally apply a udaf (you can have a look here if you are not familiar
> with them
> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), that
> will process each group "in isolation".
>
> HTH,
> Alessandro
>
> On Tue, 19 Feb 2019 at 23:30, Li Jin  wrote:
>
>> Hi,
>>
>> We have been using Pyspark's groupby().apply() quite a bit and it has
>> been very helpful in integrating Spark with our existing pandas-heavy
>> libraries.
>>
>> Recently, we have found more and more cases where groupby().apply() is
>> not sufficient - In some cases, we want to group two dataframes by the same
>> key, and apply a function which takes two pd.DataFrame (also returns a
>> pd.DataFrame) for each key. This feels very much like the "cogroup"
>> operation in the RDD API.
>>
>> It would be great to be able to do sth like this: (not actual API, just
>> to explain the use case):
>>
>> @pandas_udf(return_schema, ...)
>> def my_udf(pdf1, pdf2)
>>  # pdf1 and pdf2 are the subset of the original dataframes that is
>> associated with a particular key
>>  result = ... # some code that uses pdf1 and pdf2
>>  return result
>>
>> df3  = cogroup(df1, df2, key='some_key').apply(my_udf)
>>
>> I have searched around the problem and some people have suggested to join
>> the tables first. However, it's often not the same pattern and hard to get
>> it to work by using joins.
>>
>> I wonder what are people's thought on this?
>>
>> Li
>>
>>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread DB Tsai
Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
> 
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
> 
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>> 
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>> 
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>> 
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>> 
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> 
>> FAQ
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>> 
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>> 
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>> 
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>> 
>> ==
>> But my bug isn't fixed?
>> ==
>> 
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>> 
>> 
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>> Inc
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unsubscribe

2019-02-20 Thread northbright
Unsubscribe pls


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread Marcelo Vanzin
Just wanted to point out that
https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
and is marked as a correctness bug. (The fix is in the 2.4 branch,
just not in rc2.)

On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.1.
>
> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.1-rc2 (commit 
> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
> https://github.com/apache/spark/tree/v2.4.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1299/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.1?
> ===
>
> The current list of open tickets targeted at 2.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
> Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-20 Thread DB Tsai
Please vote on releasing the following candidate as Apache Spark version 2.4.1.

The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.1-rc2 (commit 
229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
https://github.com/apache/spark/tree/v2.4.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1299/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/2.4.1

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.1?
===

The current list of open tickets targeted at 2.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Moving forward with the timestamp proposal

2019-02-20 Thread Zoltan Ivanfi
Hi,

Last december we shared a timestamp harmonization proposal
 with the Hive, Spark and Impala communities. This
was followed by an extensive discussion in January that lead to various
updates and improvements to the proposal, as well as the creation of a new
document for file format components. February has been quiet regarding this
topic and the latest revision of the proposal has been steady in the recent
weeks.

In short, the following is being proposed (please see the document for
details):

   - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
   semantics.
   - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant semantics.
   - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime semantics.

This proposal is in accordance with the SQL standard and many major DB
engines.

Based on the feedback we got I believe that the latest revision of the
proposal addresses the needs of all affected components, therefore I would
like to move forward and create JIRA-s and/or roadmap documentation pages
for the desired semantics of the different SQL types according to the
proposal.

Please let me know if you have any remaning concerns about the proposal or
about the course of action outlined above.

Thanks,

Zoltan


Re: Thoughts on dataframe cogroup?

2019-02-20 Thread Alessandro Solimando
Hello,
I fail to see how an equi-join on the key columns is different than the
cogroup you propose.

I think the accepted answer can shed some light:
https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark

Now you apply an udf on each iterable, one per key value (obtained with
cogroup).

You can achieve the same by:
1) join df1 and df2 on the key you want,
2) apply "groupby" on such key
3) finally apply a udaf (you can have a look here if you are not familiar
with them https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html),
that will process each group "in isolation".

HTH,
Alessandro

On Tue, 19 Feb 2019 at 23:30, Li Jin  wrote:

> Hi,
>
> We have been using Pyspark's groupby().apply() quite a bit and it has been
> very helpful in integrating Spark with our existing pandas-heavy libraries.
>
> Recently, we have found more and more cases where groupby().apply() is not
> sufficient - In some cases, we want to group two dataframes by the same
> key, and apply a function which takes two pd.DataFrame (also returns a
> pd.DataFrame) for each key. This feels very much like the "cogroup"
> operation in the RDD API.
>
> It would be great to be able to do sth like this: (not actual API, just to
> explain the use case):
>
> @pandas_udf(return_schema, ...)
> def my_udf(pdf1, pdf2)
>  # pdf1 and pdf2 are the subset of the original dataframes that is
> associated with a particular key
>  result = ... # some code that uses pdf1 and pdf2
>  return result
>
> df3  = cogroup(df1, df2, key='some_key').apply(my_udf)
>
> I have searched around the problem and some people have suggested to join
> the tables first. However, it's often not the same pattern and hard to get
> it to work by using joins.
>
> I wonder what are people's thought on this?
>
> Li
>
>


Re: Missing SparkR in CRAN

2019-02-20 Thread Takeshi Yamamuro
Thanks!

On Wed, Feb 20, 2019 at 12:10 PM Felix Cheung 
wrote:

> We are waiting for update from CRAN. Please hold on.
>
>
> --
> *From:* Takeshi Yamamuro 
> *Sent:* Tuesday, February 19, 2019 2:53 PM
> *To:* dev
> *Subject:* Re: Missing SparkR in CRAN
>
> Hi, guys
>
> It seems SparkR still not found in CRAN and any problem
> when resubmitting it?
>
>
> On Fri, Jan 25, 2019 at 1:41 AM Felix Cheung 
> wrote:
>
>> Yes it was discussed on dev@. We are waiting for 2.3.3 to release to
>> resubmit.
>>
>>
>> On Thu, Jan 24, 2019 at 5:33 AM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I happened to find SparkR is missing in CRAN. See
>>> https://cran.r-project.org/web/packages/SparkR/index.html
>>>
>>> I remember I saw some threads about this in spark-dev mailing list a
>>> long long ago IIRC. Is it in progress to fix it somewhere? or is it
>>> something I misunderstood?
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-20 Thread Takeshi Yamamuro
+1

On Wed, Feb 20, 2019 at 4:59 PM JackyLee  wrote:

> +1
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro