Re: Please take a look at the draft of the Spark 3.1.1 release notes

2021-03-01 Thread Kazuaki Ishizaki
Hi Hyukjin,
Thanks for your effort.

One question: Do you automatically update the URLs to Spark documents in 
"the change of Behavior section" ? Currently, they refer to "
https://spark.apache.org/docs/3.0.0/...";. I think that they should refer 
to "https://spark.apache.org/docs/3.1.1/...";

Regards,
Kazuaki Ishizaki, 



From:   Hyukjin Kwon 
To: dev 
Cc: Dongjoon Hyun , Jungtaek Lim 
, Tom Graves 
Date:   2021/03/02 11:20
Subject:Re: Please take a look at the draft of the Spark 3.1.1 
release notes



Thanks guys for suggestions and fixes. Now I feel pretty confident about 
the release notes :-). I will start uploading and preparing to announce 
Spark 3.1.1. 2021년 3월 2일 (화) 오전 7:29, Tom Graves 
님이 작성: ‍‍
Thanks guys for suggestions and fixes. Now I feel pretty confident about 
the release notes :-).
I will start uploading and preparing to announce Spark 3.1.1.

2021년 3월 2일 (화) 오전 7:29, Tom Graves 님이 작성:
Thanks Hyukjin, overall they look good to me.

Tom
On Saturday, February 27, 2021, 05:00:42 PM CST, Jungtaek Lim <
kabhwan.opensou...@gmail.com> wrote: 


Thanks Hyukjin! I've only looked into the SS part, and added a comment. 
Otherwise it looks great! 

On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun  
wrote:
Thank you for sharing, Hyukjin!

Dongjoon.

On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon  wrote:
Hi all,

I am preparing to publish and announce Spark 3.1.1.
This is the draft of the release note, and I plan to edit a bit more and 
use it as the final release note.
Please take a look and let me know if I missed any major changes or 
something.

https://docs.google.com/document/d/1x6zzgRsZ4u1DgUh1XpGzX914CZbsHeRYpbqZ-PV6wdQ/edit?usp=sharing

Thanks.




Re: Spark on JDK 14

2020-10-28 Thread Kazuaki Ishizaki
Java 16 will also includes Vector API (incubator), which is a part of 
Project Panama, as shown in 
https://mail.openjdk.java.net/pipermail/panama-dev/2020-October/011149.html

When the next LTS will be available, we could exploit it in Spark.

Kazuaki Ishizaki



From:   Dongjoon Hyun 
To: Sean Owen 
Cc: dev 
Date:   2020/10/29 11:34
Subject:Re: Spark on JDK 14


Thank you for the sharing, Sean.

Although Java 14 is already EOL (Sep. 2020), that is important information 
because we are tracking the Java upstream.

Bests,
Dongjoon.

On Wed, Oct 28, 2020 at 1:44 PM Sean Owen  wrote:
For kicks, I tried Spark on JDK 14. 11 -> 14 doesn't change much, not as 
much as 8 -> 9 (-> 11), and indeed, virtually all tests pass. For the 
interested, these two seem to fail:

- ZooKeeperPersistenceEngine *** FAILED ***
  org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /spark/master_status

- parsing hour with various patterns *** FAILED ***
  java.time.format.DateTimeParseException: Text '2009-12-12 12 am' could 
not be parsed: Invalid value for HourOfAmPm (valid values 0 - 11): 12

I'd expect that most applications would just work now on Spark 3 + Java 
14. I'd guess the same is true for Java 16 even, but, we're probably 
focused on the LTS releases.

Kris Mok pointed out that Project Panama (in Java 17 maybe?) might have 
implications as it changes off-heap memory access.




RE: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Kazuaki Ishizaki
+1 if we add them to Alternative config.

Kazuaki Ishizaki



From:   Takeshi Yamamuro 
To: Wenchen Fan 
Cc: Spark dev list 
Date:   2020/02/13 16:02
Subject:[EXTERNAL] Re: [DISCUSS] naming policy of Spark configs



+1; the idea sounds reasonable.

Bests,
Takeshi

On Thu, Feb 13, 2020 at 12:39 PM Wenchen Fan  wrote:
Hi Dongjoon,

It's too much work to revisit all the configs that added in 3.0, but I'll 
revisit the recent commits that update config names and see if they follow 
the new policy.


Hi Reynold,

There are a few interval configs:
spark.sql.streaming.fileSink.log.compactInterval
spark.sql.streaming.continuous.executorPollIntervalMs

I think it's better to put the interval unit in the config name, like 
`executorPollIntervalMs`. Also the config should be created with 
`.timeConf`, so that users can set values like "1 second", "2 minutes", 
etc.

There is no config that uses date/timestamp as value AFAIK.


Thanks,
Wenchen

On Thu, Feb 13, 2020 at 11:29 AM Jungtaek Lim <
kabhwan.opensou...@gmail.com> wrote:
+1 Thanks for the proposal. Looks very reasonable to me.

On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon  wrote:
+1.

2020년 2월 13일 (목) 오전 9:30, Gengliang Wang <
gengliang.w...@databricks.com>님이 작성:
+1, this is really helpful. We should make the SQL configurations 
consistent and more readable.

On Wed, Feb 12, 2020 at 3:33 PM Rubén Berenguel  
wrote:
I love it, it will make configs easier to read and write. Thanks Wenchen. 

R

On 13 Feb 2020, at 00:15, Dongjoon Hyun  wrote:


Thank you, Wenchen.

The new policy looks clear to me. +1 for the explicit policy.

So, are we going to revise the existing conf names before 3.0.0 release?

Or, is it applied to new up-coming configurations from now?

Bests,
Dongjoon.

On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan  wrote:
Hi all,

I'd like to discuss the naming policy of Spark configs, as for now it 
depends on personal preference which leads to inconsistent namings.

In general, the config name should be a noun that describes its meaning 
clearly.
Good examples:
spark.sql.session.timeZone
spark.sql.streaming.continuous.executorQueueSize
spark.sql.statistics.histogram.numBins
Bad examples:
spark.sql.defaultSizeInBytes (default size for what?)

Also note that, config name has many parts, joined by dots. Each part is a 
namespace. Don't create namespace unnecessarily.
Good example:
spark.sql.execution.rangeExchange.sampleSizePerPartition
spark.sql.execution.arrow.maxRecordsPerBatch
Bad examples:
spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful 
namespace, better to be .buffer.inMemoryThreshold)

For a big feature, usually we need to create an umbrella config to turn it 
on/off, and other configs for fine-grained controls. These configs should 
share the same namespace, and the umbrella config should be named like 
featureName.enabled. For example:
spark.sql.cbo.enabled
spark.sql.cbo.starSchemaDetection
spark.sql.cbo.starJoinFTRatio
spark.sql.cbo.joinReorder.enabled
spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace
)
spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace
)

For boolean configs, in general it should end with a verb, e.g. 
spark.sql.join.preferSortMergeJoin. If the config is for a feature and you 
can't find a good verb for the feature, featureName.enabled is also good.

I'll update https://spark.apache.org/contributing.html after we reach a 
consensus here. Any comments are welcome!

Thanks,
Wenchen




-- 
---
Takeshi Yamamuro




[ANNOUNCE] Announcing Apache Spark 2.3.4

2019-09-09 Thread Kazuaki Ishizaki
We are happy to announce the availability of Spark 2.3.4!

Spark 2.3.4 is a maintenance release containing stability fixes. This
release is based on the branch-2.3 maintenance branch of Spark. We 
strongly
recommend all 2.3.x users to upgrade to this stable release.

To download Spark 2.3.4, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-4.html

We would like to acknowledge all community members for contributing to 
this
release. This release would not have been possible without you.

Kazuaki Ishizaki



Re: Welcoming some new committers and PMC members

2019-09-09 Thread Kazuaki Ishizaki
Congrats! Well deserved.

Kazuaki Ishizaki,



From:   Matei Zaharia 
To: dev 
Date:   2019/09/10 09:32
Subject:[EXTERNAL] Welcoming some new committers and PMC members



Hi all,

The Spark PMC recently voted to add several new committers and one PMC 
member. Join me in welcoming them to their new roles!

New PMC member: Dongjoon Hyun

New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, 
Weichen Xu, Ruifeng Zheng

The new committers cover lots of important areas including ML, SQL, and 
data sources, so it’s great to have them here. All the best,

Matei and the Spark PMC


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






[VOTE][RESULT] Spark 2.3.4 (RC1)

2019-08-29 Thread Kazuaki Ishizaki
Hi, All.

The vote passes. Thanks to all who helped with this release 2.3.4 (the 
final 2.3.x)! 
I'll follow up later with a release announcement once everything is 
published.


+1 (* = binding):
Sean Owen*
Dongjoon Hyun
DB Tsai*
Wenchen Fan*
Marco Gaido
Shrinidhi kanchi
John Zhuge
Marcelo Vanzin*


+0: None


-1: None


Regards,
Kazuaki Ishizaki




RE: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Kazuaki Ishizaki
+1
Built and tested with `mvn -Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl 
-Phive -Phive-thriftserver test` on OpenJDK 1.8.0_211 on Ubuntu 16.04 
x86_64

Regards,
Kazuaki Ishizaki



From:   Dongjoon Hyun 
To: dev 
Date:   2019/08/28 12:14
Subject:[EXTERNAL] Re: [VOTE] Release Apache Spark 2.4.4 (RC3)



+1.

- Checked checksums and signatures of artifacts.
- Checked to have all binaries and maven repo.
- Checked document generation (including a new change after RC2)
- Build with `-Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver 
-Phadoop-2.6` on AdoptOpenJDK8_202.
- Tested with both Scala-2.11/Scala-2.12 and both Python2/3.
   Python 2.7.15 with numpy 1.16.4, scipy 1.2.2, pandas 0.19.2, pyarrow 
0.8.0
   Python 3.6.4 with numpy 1.16.4, scipy 1.2.2, pandas 0.23.2, pyarrow 
0.11.0
- Tested JDBC IT.

Bests,
Dongjoon.


On Tue, Aug 27, 2019 at 4:05 PM Dongjoon Hyun  
wrote:
Please vote on releasing the following candidate as Apache Spark version 
2.4.4.

The vote is open until August 30th 5PM PST and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.4
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.4-rc3 (commit 
7955b3962ac46b89564e0613db7bea98a1478bf2):
https://github.com/apache/spark/tree/v2.4.4-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1332/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/

The list of bug fixes going into 2.4.4 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12345466

This release is using the release script of the tag v2.4.4-rc3.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.4?
===

The current list of open tickets targeted at 2.4.4 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target 
Version/s" = 2.4.4

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.




Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-26 Thread Kazuaki Ishizaki
Thank you for pointing out the problem.
The characters and hyperlink point different URLs.

Could you please access 
https://repository.apache.org/content/repositories/orgapachespark-1331/ as 
you see characters?

Sorry for your inconvenience.
Kazuaki Ishizaki,



From:   Takeshi Yamamuro 
To: Kazuaki Ishizaki 
Cc: Apache Spark Dev 
Date:   2019/08/27 08:49
Subject:Re: [VOTE] Release Apache Spark 2.3.4 (RC1)



Hi, 

Thanks for the release manage!
It seems the staging repository has not been exposed yet?
https://repository.apache.org/content/repositories/orgapachespark-1328/

On Tue, Aug 27, 2019 at 5:28 AM Kazuaki Ishizaki  
wrote:
Please vote on releasing the following candidate as Apache Spark version 
2.3.4.

The vote is open until August 29th 2PM PST and passes if a majority +1 PMC 
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.4
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.4-rc1 (commit 
8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
https://github.com/apache/spark/tree/v2.3.4-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1331/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/

The list of bug fixes going into 2.3.4 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12344844

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.4?
===

The current list of open tickets targeted at 2.3.4 can be found at:
https://issues.apache.org/jira/projects/SPARKand search for "Target 
Version/s" = 2.3.4

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.



-- 
---
Takeshi Yamamuro




[VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-26 Thread Kazuaki Ishizaki
Please vote on releasing the following candidate as Apache Spark version 
2.3.4.

The vote is open until August 29th 2PM PST and passes if a majority +1 PMC 
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.4
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.4-rc1 (commit 
8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
https://github.com/apache/spark/tree/v2.3.4-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1331/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/

The list of bug fixes going into 2.3.4 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12344844

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.4?
===

The current list of open tickets targeted at 2.3.4 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target 
Version/s" = 2.3.4

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.




RE: Release Spark 2.3.4

2019-08-22 Thread Kazuaki Ishizaki
The following PRs regarding SPARK-28699 have been merged into branch-2.3.
https://github.com/apache/spark/pull/25491
https://github.com/apache/spark/pull/25498
-> https://github.com/apache/spark/pull/25508 (backport to 2.3)

I will cut `2.3.4-rc1` tag during weekend and starts 2.3.1 RC1 on next 
Monday.

Regards,
Kazuaki Ishizaki



From:   "Kazuaki Ishizaki" 
To: "Kazuaki Ishizaki" 
Cc: Dilip Biswal , dev , 
Hyukjin Kwon , jzh...@apache.org, Takeshi Yamamuro 
, Xiao Li 
Date:   2019/08/20 13:12
Subject:[EXTERNAL] RE: Release Spark 2.3.4



Due to the recent correctness issue at SPARK-28699, I will delay the 
release for Spark 2.3.4 RC1 for a while.
https://issues.apache.org/jira/browse/SPARK-28699?focusedCommentId=16910859&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16910859


Regards,
Kazuaki Ishizaki



From:"Kazuaki Ishizaki" 
To:Hyukjin Kwon 
Cc:Dilip Biswal , dev , 
jzh...@apache.org, Takeshi Yamamuro , Xiao Li 

Date:2019/08/19 11:17
Subject:[EXTERNAL] RE: Release Spark 2.3.4



Hi all,
Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be 
in parallel with RC for 2.4.4 managed by Dongjoon.

Regards,
Kazuaki Ishizaki



From:Hyukjin Kwon 
To:Dilip Biswal 
Cc:    jzh...@apache.org, dev , Kazuaki Ishizaki 
, Takeshi Yamamuro , Xiao Li 

Date:2019/08/17 16:37
Subject:[EXTERNAL] Re: Release Spark 2.3.4



+1 too

2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성
:
+1

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com


- Original message -
From: John Zhuge 
To: Xiao Li 
Cc: Takeshi Yamamuro , Spark dev list <
dev@spark.apache.org>, Kazuaki Ishizaki 
Subject: [EXTERNAL] Re: Release Spark 2.3.4
Date: Fri, Aug 16, 2019 4:33 PM
 
+1
 
On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
+1
 
On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro  
wrote:
+1, too 

Bests,
Takeshi
 
On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun  
wrote:
+1 for 2.3.4 release as the last release for `branch-2.3` EOL. 

Also, +1 for next week release.

Bests,
Dongjoon.

 
On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
I think it's fine to do these in parallel, yes. Go ahead if you are 
willing.

On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki  
wrote:
>
> Hi, All.
>
> Spark 2.3.3 was released six months ago (15th February, 2019) at 
http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 
months have been passed after Spark 2.3.0 has been released (28th 
February, 2018).
> As of today (16th August), there are 103 commits (69 JIRAs) in 
`branch-23` since 2.3.3.
>
> It would be great if we can have Spark 2.3.4.
> If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 
2.4.4 will be released?
>
> A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12344844
> A commit list in github from the last release: 
https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3

> The 8 correctness issues resolved in branch-2.3:
> 
https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

>
> Best Regards,
> Kazuaki Ishizaki

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 

-- 
---
Takeshi Yamamuro
 

-- 

 

-- 
John Zhuge


- To 
unsubscribe e-mail: dev-unsubscr...@spark.apache.org






RE: Release Spark 2.3.4

2019-08-19 Thread Kazuaki Ishizaki
Due to the recent correctness issue at SPARK-28699, I will delay the 
release for Spark 2.3.4 RC1 for a while.
https://issues.apache.org/jira/browse/SPARK-28699?focusedCommentId=16910859&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16910859


Regards,
Kazuaki Ishizaki



From:   "Kazuaki Ishizaki" 
To: Hyukjin Kwon 
Cc: Dilip Biswal , dev , 
jzh...@apache.org, Takeshi Yamamuro , Xiao Li 

Date:   2019/08/19 11:17
Subject:[EXTERNAL] RE: Release Spark 2.3.4



Hi all,
Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be 
in parallel with RC for 2.4.4 managed by Dongjoon.

Regards,
Kazuaki Ishizaki



From:Hyukjin Kwon 
To:Dilip Biswal 
Cc:jzh...@apache.org, dev , Kazuaki Ishizaki 
, Takeshi Yamamuro , Xiao Li 

Date:2019/08/17 16:37
Subject:[EXTERNAL] Re: Release Spark 2.3.4



+1 too

2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성
:
+1
 
Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com
 
 
- Original message -
From: John Zhuge 
To: Xiao Li 
Cc: Takeshi Yamamuro , Spark dev list <
dev@spark.apache.org>, Kazuaki Ishizaki 
Subject: [EXTERNAL] Re: Release Spark 2.3.4
Date: Fri, Aug 16, 2019 4:33 PM
 
+1
 
On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
+1
 
On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro  
wrote:
+1, too 
 
Bests,
Takeshi
 
On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun  
wrote:
+1 for 2.3.4 release as the last release for `branch-2.3` EOL. 
 
Also, +1 for next week release.
 
Bests,
Dongjoon.
 
 
On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
I think it's fine to do these in parallel, yes. Go ahead if you are 
willing.

On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki  
wrote:
>
> Hi, All.
>
> Spark 2.3.3 was released six months ago (15th February, 2019) at 
http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 
months have been passed after Spark 2.3.0 has been released (28th 
February, 2018).
> As of today (16th August), there are 103 commits (69 JIRAs) in 
`branch-23` since 2.3.3.
>
> It would be great if we can have Spark 2.3.4.
> If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 
2.4.4 will be released?
>
> A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12344844
> A commit list in github from the last release: 
https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3

> The 8 correctness issues resolved in branch-2.3:
> 
https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

>
> Best Regards,
> Kazuaki Ishizaki

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 
 
 
-- 
---
Takeshi Yamamuro
 
 
-- 
 
 
 
-- 
John Zhuge
 

- To 
unsubscribe e-mail: dev-unsubscr...@spark.apache.org





RE: Release Spark 2.3.4

2019-08-18 Thread Kazuaki Ishizaki
Hi all,
Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be 
in parallel with RC for 2.4.4 managed by Dongjoon.

Regards,
Kazuaki Ishizaki



From:   Hyukjin Kwon 
To: Dilip Biswal 
Cc: jzh...@apache.org, dev , Kazuaki Ishizaki 
, Takeshi Yamamuro , Xiao Li 

Date:   2019/08/17 16:37
Subject:[EXTERNAL] Re: Release Spark 2.3.4



+1 too

2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성
:
+1
 
Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com
 
 
- Original message -
From: John Zhuge 
To: Xiao Li 
Cc: Takeshi Yamamuro , Spark dev list <
dev@spark.apache.org>, Kazuaki Ishizaki 
Subject: [EXTERNAL] Re: Release Spark 2.3.4
Date: Fri, Aug 16, 2019 4:33 PM
  
+1
  
On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
+1
  
On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro  
wrote:
+1, too 
 
Bests,
Takeshi
  
On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun  
wrote:
+1 for 2.3.4 release as the last release for `branch-2.3` EOL. 
 
Also, +1 for next week release.
 
Bests,
Dongjoon.
 
  
On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
I think it's fine to do these in parallel, yes. Go ahead if you are 
willing.

On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki  
wrote:
>
> Hi, All.
>
> Spark 2.3.3 was released six months ago (15th February, 2019) at 
http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 
months have been passed after Spark 2.3.0 has been released (28th 
February, 2018).
> As of today (16th August), there are 103 commits (69 JIRAs) in 
`branch-23` since 2.3.3.
>
> It would be great if we can have Spark 2.3.4.
> If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 
2.4.4 will be released?
>
> A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12344844
> A commit list in github from the last release: 
https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3

> The 8 correctness issues resolved in branch-2.3:
> 
https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

>
> Best Regards,
> Kazuaki Ishizaki

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 
  
 
-- 
---
Takeshi Yamamuro
  
 
-- 
 
  
 
-- 
John Zhuge
 

- To 
unsubscribe e-mail: dev-unsubscr...@spark.apache.org 




Release Spark 2.3.4

2019-08-16 Thread Kazuaki Ishizaki
Hi, All.

Spark 2.3.3 was released six months ago (15th February, 2019) at 
http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 
months have been passed after Spark 2.3.0 has been released (28th 
February, 2018).
As of today (16th August), there are 103 commits (69 JIRAs) in `branch-23` 
since 2.3.3.

It would be great if we can have Spark 2.3.4.
If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 
2.4.4 will be released?

A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12344844
A commit list in github from the last release: 
https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
The 8 correctness issues resolved in branch-2.3:
https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

Best Regards,
Kazuaki Ishizaki



Re: Release Apache Spark 2.4.4

2019-08-16 Thread Kazuaki Ishizaki
Sure, I will launch a separate e-mail thread for discussing 2.3.4 later.

Regards,
Kazuaki Ishizaki, Ph.D.



From:   Dongjoon Hyun 
To: Sean Owen , Kazuaki Ishizaki 
Cc: dev 
Date:   2019/08/16 05:10
Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4



+1 for that.

Kazuaki volunteered for 2.3.4 release last month. AFAIK, he has been 
preparing that.

- 
https://lists.apache.org/thread.html/6fafeefb7715e8764ccfe5d30c90d7444378b5f4f383ec95e2f1d7de@%3Cdev.spark.apache.org%3E

I believe we can handle them after 2.4.4 RC1 (or concurrently.)

Hi, Kazuaki.
Could you start a separate email thread for 2.3.4 release?

Bests,
Dongjoon.


On Thu, Aug 15, 2019 at 8:43 AM Sean Owen  wrote:
While we're on the topic:

In theory, branch 2.3 is meant to be unsupported as of right about now.

There are 69 fixes in branch 2.3 since 2.3.3 was released in Februrary:
https://issues.apache.org/jira/projects/SPARK/versions/12344844

Some look moderately important.

Should we also, or first, cut 2.3.4 to end the 2.3.x line?

On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun  
wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in 
`branch-24` since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.




RE: Release Apache Spark 2.4.4

2019-08-13 Thread Kazuaki Ishizaki
Thanks, Dongjoon!
+1

Kazuaki Ishizaki,



From:   Hyukjin Kwon 
To: Takeshi Yamamuro 
Cc: Dongjoon Hyun , dev 
, User 
Date:   2019/08/14 09:21
Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4



+1

2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님
이 작성:
Hi,

Thanks for your notification, Dongjoon!
I put some links for the other committers/PMCs to access the info easily:

A commit list in github from the last release: 
https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
A issue list in jira: 
https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
The 5 correctness issues resolved in branch-2.4:
https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

Anyway, +1

Best,
Takeshi

On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:
+1

On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun  
wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in 
`branch-24` since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-- 
---
Takeshi Yamamuro




Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Kazuaki Ishizaki
Thank you Dongjoon for being a release manager.

If the assumed dates are ok, I would like to volunteer for an 2.3.4 
release manager.

Best Regards,
Kazuaki Ishizaki,



From:   Dongjoon Hyun 
To: dev , "user @spark" , 
Apache Spark PMC 
Date:   2019/07/13 07:18
Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 before 3.0.0



Thank you, Jacek.

BTW, I added `@private` since we need PMC's help to make an Apache Spark 
release.

Can I get more feedbacks from the other PMC members?

Please me know if you have any concerns (e.g. Release date or Release 
manager?)

As one of the community members, I assumed the followings (if we are on 
schedule).

- 2.4.4 at the end of July
- 2.3.4 at the end of August (since 2.3.0 was released at the end of 
February 2018)
- 3.0.0 (possibily September?)
- 3.1.0 (January 2020?)

Bests,
Dongjoon.


On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
Hi,

Thanks Dongjoon Hyun for stepping up as a release manager! 
Much appreciated. 

If there's a volunteer to cut a release, I'm always to support it.

In addition, the more frequent releases the better for end users so they 
have a choice to upgrade and have all the latest fixes or wait. It's their 
call not ours (when we'd keep them waiting).

My big 2 yes'es for the release!

Jacek


On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
Hi, All.

Spark 2.4.3 was released two months ago (8th May).

As of today (9th July), there exist 45 fixes in `branch-2.4` including the 
following correctness or blocker issues.

- SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for 
decimals not fitting in long
- SPARK-26045 Error in the spark 2.4 release package with the 
spark-avro_2.11 dependency
- SPARK-27798 from_avro can modify variables in other rows in local 
mode
- SPARK-27907 HiveUDAF should return NULL in case of 0 rows
- SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
- SPARK-28308 CalendarInterval sub-second part should be padded before 
parsing

It would be great if we can have Spark 2.4.4 before we are going to get 
busier for 3.0.0.
If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll 
it next Monday. (15th July).
How do you think about this?

Bests,
Dongjoon.




Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-09 Thread Kazuaki Ishizaki
+1 (non-binding)

Kazuaki Ishizaki



From:   Bryan Cutler 
To: Bobby Evans 
Cc: Thomas graves , Spark dev list 

Date:   2019/05/09 03:20
Subject:Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended 
Columnar Processing Support



+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans  wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Kazuaki Ishizaki
Looks interesting discussion.
Let me describe the current structure and remaining issues. This is 
orthogonal to cost-benefit trade-off discussion.

The code generation basically consists of three parts.
1. Loading
2. Selection (map, filter, ...)
3. Projection

1. Columnar storage (e.g. Parquet, Orc, Arrow , and table cache) is well 
abstracted by using ColumnVector (
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
) class. By combining with ColumnarBatchScan, the whole-stage code 
generation generate code to directly get valus from the columnar storage 
if there is no row-based operation.
Note: The current master does not support Arrow as a data source. However, 
I think it is not technically hard to support Arrow.

2. The current whole-stage codegen generates code for element-wise 
selection (excluding sort and join). The SIMDzation or GPUization 
capability depends on a compiler that translates native code from the code 
generated by the whole-stage codegen.

3. The current Projection assume to store row-oriented data, I think that 
is a part that Wenchen pointed out

My slides 
https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41
 
may simplify the above issue and possible implementation.



FYI. NVIDIA will present an approach to exploit GPU with Arrow thru Python 
at SAIS 2019 
https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=110
. I think that it uses Python UDF support with Arrow in Spark.

P.S. I will give a presentation about in-memory data storages for SPark at 
SAIS 2019 
https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 
:)

Kazuaki Ishizaki



From:   Wenchen Fan 
To: Bobby Evans 
Cc: Spark dev list 
Date:   2019/03/26 13:53
Subject:Re: [DISCUSS] Spark Columnar Processing



Do you have some initial perf numbers? It seems fine to me to remain 
row-based inside Spark with whole-stage-codegen, and convert rows to 
columnar batches when communicating with external systems.

On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans  wrote:
This thread is to discuss adding in support for data frame processing 
using an in-memory columnar format compatible with Apache Arrow.  My main 
goal in this is to lay the groundwork so we can add in support for GPU 
accelerated processing of data frames, but this feature has a number of 
other benefits.  Spark currently supports Apache Arrow formatted data as 
an option to exchange data with python for pandas UDF processing. There 
has also been discussion around extending this to allow for exchanging 
data with other tools like pytorch, tensorflow, xgboost,... If Spark 
supports processing on Arrow compatible data it could eliminate the 
serialization/deserialization overhead when going between these systems.  
It also would allow for doing optimizations on a CPU with SIMD 
instructions similar to what Hive currently supports. Accelerated 
processing using a GPU is something that we will start a separate 
discussion thread on, but I wanted to set the context a bit.
Jason Lowe, Tom Graves, and I created a prototype over the past few months 
to try and understand how to make this work.  What we are proposing is 
based off of lessons learned when building this prototype, but we really 
wanted to get feedback early on from the community. We will file a SPIP 
once we can get agreement that this is a good direction to go in.

The current support for columnar processing lets a Parquet or Orc file 
format return a ColumnarBatch inside an RDD[InternalRow] using Scala’s 
type erasure. The code generation is aware that the RDD actually holds 
ColumnarBatchs and generates code to loop through the data in each batch 
as InternalRows.

Instead, we propose a new set of APIs to work on an 
RDD[InternalColumnarBatch] instead of abusing type erasure. With this we 
propose adding in a Rule similar to how WholeStageCodeGen currently works. 
Each part of the physical SparkPlan would expose columnar support through 
a combination of traits and method calls. The rule would then decide when 
columnar processing would start and when it would end. Switching between 
columnar and row based processing is not free, so the rule would make a 
decision based off of an estimate of the cost to do the transformation and 
the estimated speedup in processing time. 

This should allow us to disable columnar support by simply disabling the 
rule that modifies the physical SparkPlan.  It should be minimal risk to 
the existing row-based code path, as that code should not be touched, and 
in many cases could be reused to implement the columnar version.  This 
also allows for small easily manageable patches. No huge patches that no 
one wants to review.

As far as the memory layout is concerned OnHeapColumnVector and 
OffHeapColumnVector are already really close to being Apache Arrow 
compatible so shifting them over would be a relatively simple

Re: Welcome Jose Torres as a Spark committer

2019-02-05 Thread Kazuaki Ishizaki
Congratulations, Jose!

Kazuaki Ishizaki



From:   Gengliang Wang 
To: dev 
Date:   2019/01/31 18:32
Subject:Re: Welcome Jose Torres as a Spark committer



Congrats Jose!


在 2019年1月31日,上午6:51,Bryan Cutler  写道:

Congrats Jose!

On Tue, Jan 29, 2019, 10:48 AM Shixiong Zhu 

Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-11-13 Thread Kazuaki Ishizaki
Hi all,
I spend some time to consider great points. Sorry for my delay.
I put comments in green into h
ttps://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit

Here are summary of comments:
1) For simplicity and expressiveness, introduce nodes to represent a 
structure (e.g. for, while)
2) For simplicity, measure some statistics (e.g. node / java bytecode, 
memory consumption)
3) For ease of understanding, use simple APIs like the original statements 
(op2, for, while, ...)

We would appreciate it if you put any comments/suggestions on 
GoogleDoc/dev-ml for going forward.

Kazuaki Ishizaki, 



From:   "Kazuaki Ishizaki" 
To: Reynold Xin 
Cc: dev , Takeshi Yamamuro 
, Xiao Li 
Date:   2018/10/31 00:56
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi Reynold,
Thank you for your comments. They are great points.

1) Yes, it is not easy to design the expressive and enough IR. We can 
learn concepts from good examples like HyPer, Weld, and others. They are 
expressive and not complicated. The detail cannot be captured yet, 
2) To introduce another layer takes some time to learn new things. This 
SPIP tries to reduce learning time to preparing clean APIs for 
constructing generated code. I will try to add some examples for APIs that 
are equivalent to current string concatenations (e.g. "a" + " * " + "b" + 
" / " + "c").

It is important for us to learn from failures than learn from successes. 
We would appreciate it if you could list up failures that you have seen.

Best Regards,
Kazuaki Ishizaki



From:Reynold Xin 
To:Kazuaki Ishizaki 
Cc:Xiao Li , dev , 
Takeshi Yamamuro 
Date:2018/10/26 03:46
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



I have some pretty serious concerns over this proposal. I agree that there 
are many things that can be improved, but at the same time I also think 
the cost of introducing a new IR in the middle is extremely high. Having 
participated in designing some of the IRs in other systems, I've seen more 
failures than successes. The failures typically come from two sources: (1) 
in general it is extremely difficult to design IRs that are both 
expressive enough and are simple enough; (2) typically another layer of 
indirection increases the complexity a lot more, beyond the level of 
understanding and expertise that most contributors can obtain without 
spending years in the code base and learning about all the gotchas.

In either case, I'm not saying "no please don't do this". This is one of 
those cases in which the devils are in the details that cannot be captured 
by a high level document, and I want to explicitly express my concern 
here.




On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki  
wrote:
Hi Xiao,
Thank you very much for becoming a shepherd.
If you feel the discussion settles, we would appreciate it if you would 
start a voting.

Regards,
Kazuaki Ishizaki



From:Xiao Li 
To:Kazuaki Ishizaki 
Cc:dev , Takeshi Yamamuro <
linguin@gmail.com>
Date:2018/10/22 16:31
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, Kazuaki, 

Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. 

Cheers,

Xiao


On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:Takeshi Yamamuro 
To:Spark dev list 
Cc:ishiz...@jp.ibm.com
Date:2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc

Re: Test and support only LTS JDK release?

2018-11-07 Thread Kazuaki Ishizaki
This entry includes a good figure for support lifecycle.
https://www.azul.com/products/zulu-and-zulu-enterprise/zulu-enterprise-java-support-options/

Kazuaki Ishizaki,



From:   Marcelo Vanzin 
To: Felix Cheung 
Cc: Ryan Blue , sn...@snazy.de, dev 
, Cesar Delgado 
Date:   2018/11/07 08:29
Subject:Re: Test and support only LTS JDK release?



https://www.oracle.com/technetwork/java/javase/eol-135779.html

On Tue, Nov 6, 2018 at 2:56 PM Felix Cheung  
wrote:
>
> Is there a list of LTS release that I can reference?
>
>
> 
> From: Ryan Blue 
> Sent: Tuesday, November 6, 2018 1:28 PM
> To: sn...@snazy.de
> Cc: Spark Dev List; cdelg...@apple.com
> Subject: Re: Test and support only LTS JDK release?
>
> +1 for supporting LTS releases.
>
> On Tue, Nov 6, 2018 at 11:48 AM Robert Stupp  wrote:
>>
>> +1 on supporting LTS releases.
>>
>> VM distributors (RedHat, Azul - to name two) want to provide patches to 
LTS versions (i.e. into 
http://hg.openjdk.java.net/jdk-updates/jdk11u/
). How that will play out in reality ... I don't know. Whether Oracle will 
contribute to that repo for 8 after it's EOL and 11 after the 6 month 
cycle ... we will see. Most Linux distributions promised(?) long-term 
support for Java 11 in their LTS releases (e.g. Ubuntu 18.04). I am not 
sure what that exactly means ... whether they will actively provide 
patches to OpenJDK or whether they just build from source.
>>
>> But considering that, I think it's definitely worth to at least keep an 
eye on Java 12 and 13 - even if those are just EA. Java 12 for example 
does already forbid some "dirty tricks" that are still possible in Java 
11.
>>
>>
>> On 11/6/18 8:32 PM, DB Tsai wrote:
>>
>> OpenJDK will follow Oracle's release cycle, 
https://openjdk.java.net/projects/jdk/
, a strict six months model. I'm not familiar with other non-Oracle VMs 
and Redhat support.
>>
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   
Apple, Inc
>>
>> On Nov 6, 2018, at 11:26 AM, Reynold Xin  wrote:
>>
>> What does OpenJDK do and other non-Oracle VMs? I know there was a lot 
of discussions from Redhat etc to support.
>>
>>
>> On Tue, Nov 6, 2018 at 11:24 AM DB Tsai  wrote:
>>>
>>> Given Oracle's new 6-month release model, I feel the only realistic 
option is to only test and support JDK such as JDK 11 LTS and future LTS 
release. I would like to have a discussion on this in Spark community.
>>>
>>> Thanks,
>>>
>>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   
Apple, Inc
>>>
>>
>> --
>> Robert Stupp
>> @snazy
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org







Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-30 Thread Kazuaki Ishizaki
Hi Reynold,
Thank you for your comments. They are great points.

1) Yes, it is not easy to design the expressive and enough IR. We can 
learn concepts from good examples like HyPer, Weld, and others. They are 
expressive and not complicated. The detail cannot be captured yet, 
2) To introduce another layer takes some time to learn new things. This 
SPIP tries to reduce learning time to preparing clean APIs for 
constructing generated code. I will try to add some examples for APIs that 
are equivalent to current string concatenations (e.g. "a" + " * " + "b" + 
" / " + "c").

It is important for us to learn from failures than learn from successes. 
We would appreciate it if you could list up failures that you have seen.

Best Regards,
Kazuaki Ishizaki



From:   Reynold Xin 
To: Kazuaki Ishizaki 
Cc: Xiao Li , dev , 
Takeshi Yamamuro 
Date:   2018/10/26 03:46
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



I have some pretty serious concerns over this proposal. I agree that there 
are many things that can be improved, but at the same time I also think 
the cost of introducing a new IR in the middle is extremely high. Having 
participated in designing some of the IRs in other systems, I've seen more 
failures than successes. The failures typically come from two sources: (1) 
in general it is extremely difficult to design IRs that are both 
expressive enough and are simple enough; (2) typically another layer of 
indirection increases the complexity a lot more, beyond the level of 
understanding and expertise that most contributors can obtain without 
spending years in the code base and learning about all the gotchas.

In either case, I'm not saying "no please don't do this". This is one of 
those cases in which the devils are in the details that cannot be captured 
by a high level document, and I want to explicitly express my concern 
here.




On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki  
wrote:
Hi Xiao,
Thank you very much for becoming a shepherd.
If you feel the discussion settles, we would appreciate it if you would 
start a voting.

Regards,
Kazuaki Ishizaki



From:Xiao Li 
To:Kazuaki Ishizaki 
Cc:dev , Takeshi Yamamuro <
linguin@gmail.com>
Date:2018/10/22 16:31
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, Kazuaki, 

Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. 

Cheers,

Xiao


On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:Takeshi Yamamuro 
To:Spark dev list 
Cc:ishiz...@jp.ibm.com
Date:2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing


Looking forward to hear your feedback

Best Regards,
Kazuaki Ishizaki


-- 
---
Takeshi Yamamuro



-- 






Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-25 Thread Kazuaki Ishizaki
Hi Xiao,
Thank you very much for becoming a shepherd.
If you feel the discussion settles, we would appreciate it if you would 
start a voting.

Regards,
Kazuaki Ishizaki



From:   Xiao Li 
To: Kazuaki Ishizaki 
Cc: dev , Takeshi Yamamuro 

Date:   2018/10/22 16:31
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, Kazuaki, 

Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. 

Cheers,

Xiao


On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:Takeshi Yamamuro 
To:Spark dev list 
Cc:ishiz...@jp.ibm.com
Date:2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing


Looking forward to hear your feedback

Best Regards,
Kazuaki Ishizaki


-- 
---
Takeshi Yamamuro



-- 





Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-22 Thread Kazuaki Ishizaki
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:   Takeshi Yamamuro 
To: Spark dev list 
Cc: ishiz...@jp.ibm.com
Date:   2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing


Looking forward to hear your feedback

Best Regards,
Kazuaki Ishizaki


-- 
---
Takeshi Yamamuro




SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing

Looking forward to hear your feedback

Best Regards,
Kazuaki Ishizaki



Re: Spark JIRA tags clarification and management

2018-09-04 Thread Kazuaki Ishizaki
Of course, we would like to eliminate all of the following tags

"flanky" or "flankytest"

Kazuaki Ishizaki



From:   Hyukjin Kwon 
To: dev 
Cc: Xiao Li , Wenchen Fan 
Date:   2018/09/04 14:20
Subject:Re: Spark JIRA tags clarification and management



Thanks, Reynold.

+Adding Xiao and Wenchen who I saw often used tags.

Would you have some tags you think we should document more?

2018년 9월 4일 (화) 오전 9:27, Reynold Xin 님이 작성:
The most common ones we do are:

releasenotes

correctness



On Mon, Sep 3, 2018 at 6:23 PM Hyukjin Kwon  wrote:
Thanks, Felix and Reynold. Would you guys mind if I ask this to anyone who 
use the tags frequently? Frankly, I don't use the tags often ..

2018년 9월 4일 (화) 오전 2:04, Felix Cheung 님
이 작성:
+1 good idea.
There are a few for organizing but some also are critical to the release 
process, like rel note. Would be good to clarify.


From: Reynold Xin 
Sent: Sunday, September 2, 2018 11:50 PM
To: Hyukjin Kwon
Cc: dev
Subject: Re: Spark JIRA tags clarification and management 
 
It would be great to document the common ones.

On Sun, Sep 2, 2018 at 11:49 PM Hyukjin Kwon  wrote:
Hi all, 

I lately noticed tags are often used to classify JIRAs. I was thinking we 
better explicitly document what tags are used and explain which tag means 
what. For instance, we documented "Contributing to JIRA Maintenance" at 
https://spark.apache.org/contributing.html before (thanks, Sean Owen) - 
this helps me a lot to managing JIRAs, and they are good standards for, at 
least, me to take an action.

It doesn't necessarily mean we should clarify everything but it might be 
good to document tags used often.

We can leave this for committer's scope as well, if that's preferred - I 
don't have a strong opinion on this. My point is, can we clarify this in 
the contributing guide so that we can reduce the maintenance cost?





Re: [SPARK ML] Minhash integer overflow

2018-07-07 Thread Kazuaki Ishizaki
Of course, the hash value can just be negative. I thought that it would be 
after computation without overflow.

When I checked another implementation, it performs computations with int.
https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89

By copy to Xjiayuan, did you compare the hash value generated by Spark 
with it generated by other implementations?

Regards,
Kazuaki Ishizaki



From:   Sean Owen 
To: jiayuanm 
Cc: dev@spark.apache.org
Date:   2018/07/07 15:46
Subject:Re: [SPARK ML] Minhash integer overflow



I think it probably still does its.job; the hash value can just be 
negative. It is likely to be very slightly biased though. Because the 
intent doesn't seem to be to allow the overflow it's worth changing to use 
longs for the calculation. 

On Fri, Jul 6, 2018, 8:36 PM jiayuanm  wrote:
Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread Kazuaki Ishizaki
Thank for you reporting this issue. I think this is a bug regarding 
integer overflow. IMHO, it would be good to compute hashes with Long.

Would it be possible to create a JIRA entry?  Do you want to submit a pull 
request, too?

Regards,
Kazuaki Ishizaki



From:   jiayuanm 
To: dev@spark.apache.org
Date:   2018/07/07 10:36
Subject:[SPARK ML] Minhash integer overflow



Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: 
http://apache-spark-developers-list.1001551.n3.nabble.com/


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org







Re: SparkR test failures in PR builder

2018-05-02 Thread Kazuaki Ishizaki
I am not familiar with SparkR or CRAN. However, I remember that we had the 
similar situation.

Here is a great work at that time. When I have just visited this PR, I 
think that we have the similar situation (i.e. format error) again.
https://github.com/apache/spark/pull/20005

Any other comments are appreciated.

Regards,
Kazuaki Ishizaki



From:   Joseph Bradley 
To: dev 
Cc: Hossein Falaki 
Date:   2018/05/03 07:31
Subject:SparkR test failures in PR builder



Hi all,

Does anyone know why the PR builder keeps failing on SparkR's CRAN 
checks?  I've seen this in a lot of unrelated PRs.  E.g.: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console

Hossein spotted this line:
```
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) : 
  dims [product 24] do not match the length of object [0]
```
and suggested that it could be CRAN flakiness.  I'm not familiar with 
CRAN, but do others have thoughts about how to fix this?

Thanks!
Joseph

-- 
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.





Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Kazuaki Ishizaki
Congratulations to Zhenhua!

Kazuaki Ishizaki



From:   sujith chacko 
To: Denny Lee 
Cc: Spark dev list , Wenchen Fan 
, "叶先进" 
Date:   2018/04/02 14:37
Subject:Re: Welcome Zhenhua Wang as a Spark committer



Congratulations zhenhua for this great achievement.

On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee  wrote:
Awesome - congrats Zhenhua! 

On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:
Big congs.

> On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
>
> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project. 
Zhenhua is the major contributor of the CBO project, and has been 
contributing across several areas of Spark for a while, focusing 
especially on analyzer, optimizer in Spark SQL. Please join me in 
welcoming Zhenhua!
>
> Wenchen


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: Welcoming some new committers

2018-03-02 Thread Kazuaki Ishizaki
Congratulations to everyone!

Kazuaki Ishizaki



From:   Takeshi Yamamuro 
To: Spark dev list 
Date:   2018/03/03 10:45
Subject:Re: Welcoming some new committers



Congrats, all!

On Sat, Mar 3, 2018 at 10:34 AM, Takuya UESHIN  
wrote:
Congratulations and welcome!

On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang  
wrote:
Congratulations to everyone!

2018-03-03 8:51 GMT+08:00 Ilan Filonenko :
Congrats to everyone! :) 

On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung  
wrote:
Congrats and welcome!


From: Dongjoon Hyun 
Sent: Friday, March 2, 2018 4:27:10 PM
To: Spark dev list
Subject: Re: Welcoming some new committers 
 
Congrats to all!

Bests,
Dongjoon.

On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan  wrote:
Congratulations to everyone and welcome!

On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger  wrote:
Congrats to the new committers, and I appreciate the vote of confidence.

On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia  
wrote:
> Hi everyone,
>
> The Spark PMC has recently voted to add several new committers to the 
project, based on their contributions to Spark 2.3 and other past work:
>
> - Anirudh Ramanathan (contributor to Kubernetes support)
> - Bryan Cutler (contributor to PySpark and Arrow support)
> - Cody Koeninger (contributor to streaming and Kafka support)
> - Erik Erlandson (contributor to Kubernetes support)
> - Matt Cheah (contributor to Kubernetes support and other parts of 
Spark)
> - Seth Hendrickson (contributor to MLlib and PySpark)
>
> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
committers!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org







-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin



-- 
---
Takeshi Yamamuro




Re: Whole-stage codegen and SparkPlan.newPredicate

2018-01-01 Thread Kazuaki Ishizaki
Thank you for your correction :)

I also made mistake in a report. What I reported at first never occurs 
with the correct Java bean class.
Finally, I can reproduce a problem that Jacek reported even using the 
master. In my environment, this problem occurs with or without whole-stage 
codegen. I updated the JIRA ticket.

I am still working for this.

Kazuaki Ishizaki



From:   Herman van Hövell tot Westerflier 
To: Kazuaki Ishizaki 
Cc: Jacek Laskowski , dev 
Date:   2018/01/02 04:12
Subject:Re: Whole-stage codegen and SparkPlan.newPredicate



Wrong ticket: https://issues.apache.org/jira/browse/SPARK-22935

Thanks for working on this :)

On Mon, Jan 1, 2018 at 2:22 PM, Kazuaki Ishizaki  
wrote:
I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I 
cannot see the exception even when I disabled whole-stage codegen. Am I 
wrong?
We would appreciate it if you could create a JIRA entry with simple 
standalone repro.

In addition to this report, I realized that this program produces 
incorrect results. I created a JIRA entry 
https://issues.apache.org/jira/browse/SPARK-22934.

Best Regards,
Kazuaki Ishizaki



From:Herman van Hövell tot Westerflier 
To:Jacek Laskowski 
Cc:dev 
Date:2017/12/31 21:44
Subject:Re: Whole-stage codegen and SparkPlan.newPredicate



Hi Jacek,

In this case whole stage code generation is turned off. However we still 
use code generation for a lot of other things: projections, predicates, 
orderings & encoders. You are currently seeing a compile time failure 
while generating a predicate. There is currently no easy way to turn code 
generation off entirely.

The error itself is not great, but it still captures the problem in a 
relatively timely fashion. We should have caught this during analysis 
though. Can you file a ticket?

- Herman

On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski  wrote:
Hi,

While working on an issue with Whole-stage codegen as reported @ 
https://stackoverflow.com/q/48026060/1305344I found out 
that spark.sql.codegen.wholeStage=false does *not* turn whole-stage 
codegen off completely.


It looks like SparkPlan.newPredicate [1] gets called regardless of the 
value of spark.sql.codegen.wholeStage property.

$ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false
...
scala> spark.sessionState.conf.wholeStageEnabled
res7: Boolean = false

That leads to an issue in the SO question with whole-stage codegen 
regardless of the value:

...
  at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:385)
  at 
org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:214)
  at 
org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:213)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:816)
 
...

Is this a bug or does it work as intended? Why?

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski









Re: Whole-stage codegen and SparkPlan.newPredicate

2018-01-01 Thread Kazuaki Ishizaki
I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I 
cannot see the exception even when I disabled whole-stage codegen. Am I 
wrong?
We would appreciate it if you could create a JIRA entry with simple 
standalone repro.

In addition to this report, I realized that this program produces 
incorrect results. I created a JIRA entry 
https://issues.apache.org/jira/browse/SPARK-22934.

Best Regards,
Kazuaki Ishizaki



From:   Herman van Hövell tot Westerflier 
To: Jacek Laskowski 
Cc: dev 
Date:   2017/12/31 21:44
Subject:Re: Whole-stage codegen and SparkPlan.newPredicate



Hi Jacek,

In this case whole stage code generation is turned off. However we still 
use code generation for a lot of other things: projections, predicates, 
orderings & encoders. You are currently seeing a compile time failure 
while generating a predicate. There is currently no easy way to turn code 
generation off entirely.

The error itself is not great, but it still captures the problem in a 
relatively timely fashion. We should have caught this during analysis 
though. Can you file a ticket?

- Herman

On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski  wrote:
Hi,

While working on an issue with Whole-stage codegen as reported @ 
https://stackoverflow.com/q/48026060/1305344 I found out 
that spark.sql.codegen.wholeStage=false does *not* turn whole-stage 
codegen off completely.

It looks like SparkPlan.newPredicate [1] gets called regardless of the 
value of spark.sql.codegen.wholeStage property.

$ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false
...
scala> spark.sessionState.conf.wholeStageEnabled
res7: Boolean = false

That leads to an issue in the SO question with whole-stage codegen 
regardless of the value:

...
  at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:385)
  at 
org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:214)
  at 
org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:213)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:816)
 
...

Is this a bug or does it work as intended? Why?

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski






Re: Timeline for Spark 2.3

2017-12-21 Thread Kazuaki Ishizaki
+1 for cutting a branch earlier.
In some Asian countries, 1st, 2nd, and 3rd January are off. 
https://www.timeanddate.com/holidays/
How about 4th or 5th?

Regards,
Kazuaki Ishizaki



From:   Felix Cheung 
To: Michael Armbrust , Holden Karau 

Cc: Sameer Agarwal , Erik Erlandson 
, dev 
Date:   2017/12/21 04:48
Subject:Re: Timeline for Spark 2.3



+1
I think the earlier we cut a branch the better.


From: Michael Armbrust 
Sent: Tuesday, December 19, 2017 4:41:44 PM
To: Holden Karau
Cc: Sameer Agarwal; Erik Erlandson; dev
Subject: Re: Timeline for Spark 2.3 
 
Do people really need to be around for the branch cut (modulo the person 
cutting the branch)? 

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as 
soon as we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau  
wrote:
Sounds reasonable, although I'd choose the 2nd perhaps just since lots of 
folks are off on the 1st?

On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal  
wrote:
Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that 
(i.e., week of 8th Jan)? 


On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau  
wrote:
So personally I’d be in favour or pushing to early January, doing a 
release over the holidays is a little rough with herding all of people to 
vote. 

On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson  
wrote:
I wanted to check in on the state of the 2.3 freeze schedule.  Original 
proposal was "late Dec", which is a bit open to interpretation.

We are working to get some refactoring done on the integration testing for 
the Kubernetes back-end in preparation for testing upcoming release 
candidates, however holiday vacation time is about to begin taking its 
toll both on upstream reviewing and on the "downstream" spark-on-kube 
fork.

If the freeze pushed into January, that would take some of the pressure 
off the kube back-end upstreaming. However, regardless, I was wondering if 
the dates could be clarified.
Cheers,
Erik


On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com  
wrote:
Hi,

What is the process to request an issue/fix to be included in the next
release? Is there a place to vote for features?
I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to 
see
if we can get Spark upgrade parquet to 1.9.0, which addresses the
https://issues.apache.org/jira/browse/PARQUET-686.
Can we include the fix in Spark 2.3 release?

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-- 
Twitter: https://twitter.com/holdenkarau



-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag



-- 
Twitter: https://twitter.com/holdenkarau





Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-29 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core/sql-core/sql-catalyst/mllib/mllib-local have passed.

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 
1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

% build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 
24 clean package install
% build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core 
-pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
...
Run completed in 13 minutes, 54 seconds.
Total number of tests run: 1118
Suites: completed 170, aborted 0
Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Core . SUCCESS [17:13 
min]
[INFO] Spark Project ML Local Library . SUCCESS [ 
6.065 s]
[INFO] Spark Project Catalyst . SUCCESS [11:51 
min]
[INFO] Spark Project SQL .. SUCCESS [17:55 
min]
[INFO] Spark Project ML Library ... SUCCESS [17:05 
min]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 01:04 h
[INFO] Finished at: 2017-11-30T01:48:15+09:00
[INFO] Final Memory: 128M/329M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki



From:   Dongjoon Hyun 
To: Hyukjin Kwon 
Cc: Spark dev list , Felix Cheung 
, Sean Owen 
Date:   2017/11/29 12:56
Subject:Re: [VOTE] Spark 2.2.1 (RC2)



+1 (non-binding)

RC2 is tested on CentOS, too.

Bests,
Dongjoon.

On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon  wrote:
+1

2017-11-29 8:18 GMT+09:00 Henry Robinson :
(My vote is non-binding, of course). 

On 28 November 2017 at 14:53, Henry Robinson  wrote:
+1, tests all pass for me on Ubuntu 16.04. 

On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:
+1

On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung  
wrote:
+1

Thanks Sean. Please vote!

Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel 
and release and on r-hub. Verified CRAN checks are clean (only 1 
NOTE!) and no leaked files (.cache removed, /tmp clean)


On Sun, Nov 26, 2017 at 11:55 AM Sean Owen  wrote:
Yes it downloads recent releases. The test worked for me on a second try, 
so I suspect a bad mirror. If this comes up frequently we can just add 
retry logic, as the closer.lua script will return different mirrors each 
time.

The tests all pass for me on the latest Debian, so +1 for this release.

(I committed the change to set -Xss4m for tests consistently, but this 
shouldn't block a release.)


On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung  
wrote:
Ah sorry digging through the history it looks like this is changed 
relatively recently and should only download previous releases.

Perhaps we are intermittently hitting a mirror that doesn’t have the 
files? 


https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae


On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung  
wrote:
Thanks Sean.

For the second one, it looks like the  HiveExternalCatalogVersionsSuite is 
trying to download the release tgz from the official Apache mirror, which 
won’t work unless the release is actually, released?



val preferredMirror =


Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true";, "-q", 
"-O", "-").!!.trim

val url = s
"$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"



It’s proabbly getting an error page instead.


On Sat, Nov 25, 2017 at 10:28 AM Sean Owen  wrote:
I hit the same StackOverflowError as in the previous RC test, but, pretty 
sure this is just because the increased thread stack size JVM flag isn't 
applied consistently. This seems to resolve it:

https://github.com/apache/spark/pull/19820

This wouldn't block release IMHO.


I am currently investigating this failure though -- seems like the 
mechanism that downloads Spark tarballs needs fixing, or updating, in the 
2.2 branch?

HiveExternalCatalogVersionsSuite:
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
*** RUN ABORTED ***
  java.io.IOException: Cannot run program "./bin/spark-submit" (in 
directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or 
directory

On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung  
wrote:
Please vote on releasing the following candidate as Apach

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core/sql-core/sql-catalyst/mllib/mllib-local have passed.

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 
1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

% build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 
24 clean package install
% build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core 
-pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
...
Run completed in 12 minutes, 19 seconds.
Total number of tests run: 1035
Suites: completed 166, aborted 0
Tests: succeeded 1035, failed 0, canceled 0, ignored 5, pending 0
All tests passed.
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Core . SUCCESS [17:13 
min]
[INFO] Spark Project ML Local Library . SUCCESS [ 
5.759 s]
[INFO] Spark Project Catalyst . SUCCESS [09:48 
min]
[INFO] Spark Project SQL .. SUCCESS [12:01 
min]
[INFO] Spark Project ML Library ... SUCCESS [15:16 
min]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 54:28 min
[INFO] Finished at: 2017-10-03T23:53:33+09:00
[INFO] Final Memory: 112M/322M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki




From:   Dongjoon Hyun 
To: Spark dev list 
Date:   2017/10/03 23:23
Subject:Re: [VOTE] Spark 2.1.2 (RC4)



+1 (non-binding)

Dongjoon.

On Tue, Oct 3, 2017 at 5:13 AM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:
+1

On Tue, Oct 3, 2017 at 1:32 PM, Sean Owen  wrote:
+1 same as last RC. Tests pass, sigs and hashes are OK.

On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
Please vote on releasing the following candidate as Apache Spark 
version 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.1.2-rc4 (
2abaea9e40fce81cd4626498e0f5c28a70917499)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc4-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1252

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can 
add the staging repository to your projects resolvers and test with 
the RC (make sure to clean up the artifact cache before/after so you don't 
end up building with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1. That being said if 
there is something which is a regression form 2.1.1 that has not been 
correctly targeted please ping a committer to help target the issue (you 
can see the open issues listed as impacting Spark 2.1.1 & 2.1.2)

What are the unresolved issues targeted for 2.1.2?

At this time there are no open unresolved issues.

Is there anything different about this release?

This is the first release in awhile not built on the AMPLAB Jenkins. This 
is good because it means future releases can more easily be built and 
signed securely (and I've been updating the documentation in 
https://github.com/apache/spark-website/pull/66 as I progress), however 
the chances of a mistake are higher with any change like this. If there 
something you normal

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Kazuaki Ishizaki
Congratulation Tejas!

Kazuaki Ishizaki



From:   Matei Zaharia 
To: "dev@spark.apache.org" 
Date:   2017/09/30 04:58
Subject:Welcoming Tejas Patil as a Spark committer



Hi all,

The Spark PMC recently added Tejas Patil as a committer on the
project. Tejas has been contributing across several areas of Spark for
a while, focusing especially on scalability issues and SQL. Please
join me in welcoming Tejas!

Matei

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core/sql-core/sql-catalyst/mllib/mllib-local have passed.

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 
1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

% build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 
24 clean package install
% build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core 
-pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
...
Run completed in 12 minutes, 42 seconds.
Total number of tests run: 1035
Suites: completed 166, aborted 0
Tests: succeeded 1035, failed 0, canceled 0, ignored 5, pending 0
All tests passed.
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Core . SUCCESS [17:14 
min]
[INFO] Spark Project ML Local Library . SUCCESS [ 
4.067 s]
[INFO] Spark Project Catalyst . SUCCESS [08:23 
min]
[INFO] Spark Project SQL .. SUCCESS [10:50 
min]
[INFO] Spark Project ML Library ... SUCCESS [15:45 
min]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 52:20 min
[INFO] Finished at: 2017-09-28T12:16:46+09:00
[INFO] Final Memory: 103M/309M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki



From:   Dongjoon Hyun 
To: Denny Lee 
Cc: Sean Owen , Holden Karau 
, "dev@spark.apache.org" 
Date:   2017/09/28 07:57
Subject:Re: [VOTE] Spark 2.1.2 (RC2)



+1 (non-binding)

Bests,
Dongjoon.


On Wed, Sep 27, 2017 at 7:54 AM, Denny Lee  wrote:
+1 (non-binding)


On Wed, Sep 27, 2017 at 6:54 AM Sean Owen  wrote:
+1

I tested the source release.
Hashes and signature (your signature) check out, project builds and tests 
pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
List of issues look good and there are no open issues at all for 2.1.2.

Great work on improving the build process and docs.


On Wed, Sep 27, 2017 at 5:47 AM Holden Karau  wrote:
Please vote on releasing the following candidate as Apache Spark 
version 2.1.2. The vote is open until Wednesday October 4th at 23:59 
PST and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.1.2-rc2 (
fabbb7f59e47590114366d14e15fbbff8c88593c)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc2-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1251

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc2-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can 
add the staging repository to your projects resolvers and test with 
the RC (make sure to clean up the artifact cache before/after so you don't 
end up building with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1. That being said if 
there is something which is a regression form 2.1.1 that has not been 
correctly targeted please ping a committer to help target the issue (you 
can see the open issues listed as impacting Spark 2.1.1 & 2.1.2)

What are the unresolved issues targeted for 2.1.2?

At this time there are no open unresolved issues.

Is there anything different about this release?

This is the first release in awhile not built on the AMPLAB Jenkins. This 
is good because it means future 

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Kazuaki Ishizaki
Congratulations, Jerry!

Kazuaki Ishizaki



From:   Hyukjin Kwon 
To: dev 
Date:   2017/08/29 12:24
Subject:Re: Welcoming Saisai (Jerry) Shao as a committer



Congratulations! Very well deserved.

2017-08-29 11:41 GMT+09:00 Liwei Lin :
Congratulations, Jerry!

Cheers,
Liwei

On Tue, Aug 29, 2017 at 10:15 AM, 蒋星博  wrote:
congs!

Takeshi Yamamuro 于2017年8月28日 周一下午7:11写道:
Congrats!

On Tue, Aug 29, 2017 at 11:04 AM, zhichao  wrote:
Congratulations, Jerry!

On Tue, Aug 29, 2017 at 9:57 AM, Weiqing Yang  
wrote:
Congratulations, Jerry!

On Mon, Aug 28, 2017 at 6:44 PM, Yanbo Liang  wrote:
Congratulations, Jerry.

On Tue, Aug 29, 2017 at 9:42 AM, John Deng  wrote:

Congratulations, Jerry !

On 8/29/2017 09:28,Matei Zaharia wrote: 
Hi everyone, 

The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has 
been contributing to many areas of the project for a long time, so it
’s great to see him join. Join me in thanking and congratulating him! 

Matei 
- 
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 







-- 
---
Takeshi Yamamuro






Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Kazuaki Ishizaki
Congratulation, Hyukjin and Sameer, well deserved!!

Kazuaki Ishizaki



From:   Matei Zaharia 
To: dev 
Date:   2017/08/08 00:53
Subject:Welcoming Hyukjin Kwon and Sameer Agarwal as committers



Hi everyone,

The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as 
committers. Join me in congratulating both of them and thanking them for 
their contributions to the project!

Matei
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-06-30 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core/sql-core/sql-catalyst/mllib/mllib-local have passed.

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

% build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 
24 clean package install
% build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core 
-pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
...
Run completed in 15 minutes, 3 seconds.
Total number of tests run: 1113
Suites: completed 170, aborted 0
Tests: succeeded 1113, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Core . SUCCESS [17:24 
min]
[INFO] Spark Project ML Local Library . SUCCESS [ 
7.161 s]
[INFO] Spark Project Catalyst . SUCCESS [11:55 
min]
[INFO] Spark Project SQL .. SUCCESS [18:38 
min]
[INFO] Spark Project ML Library ... SUCCESS [18:17 
min]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 01:06 h
[INFO] Finished at: 2017-07-01T15:20:04+09:00
[INFO] Final Memory: 56M/591M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki




From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/07/01 10:45
Subject:[VOTE] Apache Spark 2.2.0 (RC6)



Please vote on releasing the following candidate as Apache Spark version 
2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.2.0-rc6 (
a2c7b2133cfee7fa9abfaa2bfbfb637155466783)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1245/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1.




Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core have passed.

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Run completed in 15 minutes, 30 seconds.
Total number of tests run: 1959
Suites: completed 206, aborted 0
Tests: succeeded 1959, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 17:16 min
[INFO] Finished at: 2017-06-06T13:44:48+09:00
[INFO] Final Memory: 53M/510M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.

Kazuaki Ishizaki



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/06/06 04:15
Subject:[VOTE] Apache Spark 2.2.0 (RC4)



Please vote on releasing the following candidate as Apache Spark version 
2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc4 (
377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1.




Re: [build system] jenkins got itself wedged...

2017-05-21 Thread Kazuaki Ishizaki
It looked well these days. However, it seems to go down slowly again...

When I tried to see console log (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull
), a server returns "proxy error."

Regards,
Kazuaki Ishizaki



From:   shane knapp 
To: Sean Owen 
Cc: dev 
Date:   2017/05/20 09:43
Subject:Re: [build system] jenkins got itself wedged...



last update of the week:

things are looking great...  we're GCing happily and staying well
within our memory limits.

i'm going to do one more restart after the two pull request builds
finish to re-enable backups, and call it a weekend.  :)

shane

On Fri, May 19, 2017 at 8:29 AM, shane knapp  wrote:
> this is hopefully my final email on the subject...   :)
>
> things have seemed to settled down after my GC tuning, and system
> load/cpu usage/memory has been nice and flat all night.  i'll continue
> to keep an eye on things but it looks like we've weathered the worst
> part of the storm.
>
> On Thu, May 18, 2017 at 6:40 PM, shane knapp  
wrote:
>> after needing another restart this afternoon, i did some homework and
>> aggressively twiddled some GC settings[1].  since then, things have
>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>
>> i've attached a screenshot of slightly happier looking graphs.
>>
>> still keeping an eye on things, and hoping that i can go back to being
>> a lurker...  ;)
>>
>> shane
>>
>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>
>> On Thu, May 18, 2017 at 11:20 AM, shane knapp  
wrote:
>>> ok, more updates:
>>>
>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>> balancing ftw.
>>>
>>> 2) the jenkins master is now running on java8, which has moar bettar
>>> GC management under the hood.
>>>
>>> i'll be keeping an eye on this today, and if we start seeing GC
>>> overhead failures, i'll start doing more GC performance tuning.
>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> shane
>>>
>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp  
wrote:
>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY 
i'm
>>>> getting some error messages in the logs...   looks like jenkins is
>>>> thrashing on GC.
>>>>
>>>> now that i know what's up, i should be able to get this sorted today.
>>>>
>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen  
wrote:
>>>>> I'm not sure if it's related, but I still can't get Jenkins to test 
PRs. For
>>>>> example, triggering it through the spark-prs.appspot.com UI gives 
me...
>>>>>
>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>
>>>>> Internal Server Error
>>>>>
>>>>> That might be from the appspot app though?
>>>>>
>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, 
and I
>>>>> can't reach Jenkins:
>>>>> 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

>>>>>
>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp  
wrote:
>>>>>>
>>>>>> after another couple of restarts due to high load and system
>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>
>>>>>> a typo in the jenkins config where the java heap size was 
configured.
>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain 
the
>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>> couple of years.
>>>>>>
>>>>>> anyways, it's been corrected and the master seems to be humming 
along,
>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on 
this
>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>
>>>>>> sorry again for the interruptions in service.
>>>>>>
>>>>>> shane
>>>>>>
>>>>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-09 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core have passed.

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Run completed in 15 minutes, 12 seconds.
Total number of tests run: 1940
Suites: completed 206, aborted 0
Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 16:51 min
[INFO] Finished at: 2017-05-09T17:51:04+09:00
[INFO] Final Memory: 53M/514M
[INFO] 

[WARNING] The requested profile "hive" could not be activated because it 
does not exist.


Kazuaki Ishizaki,



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/05/05 02:08
Subject:[VOTE] Apache Spark 2.2.0 (RC2)



Please vote on releasing the following candidate as Apache Spark version 
2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and passes 
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc2 (
1d4017b44d5e6ad156abeaae6371747f111dd1f9)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1236/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1.




Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core have passed..

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Run completed in 15 minutes, 45 seconds.
Total number of tests run: 1937
Suites: completed 205, aborted 0
Tests: succeeded 1937, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 17:26 min
[INFO] Finished at: 2017-04-29T02:23:08+09:00
[INFO] Final Memory: 53M/491M
[INFO] 
----

Kazuaki Ishizaki,



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/04/28 03:32
Subject:[VOTE] Apache Spark 2.2.0 (RC1)



Please vote on releasing the following candidate as Apache Spark version 
2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes 
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.0-rc1 (
8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1235/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.1.




Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-28 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for 
core have passed..

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Total number of tests run: 1788
Suites: completed 198, aborted 0
Tests: succeeded 1788, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 16:30 min
[INFO] Finished at: 2017-04-29T01:02:29+09:00
[INFO] Final Memory: 54M/576M
[INFO] 


Regards,
Kazuaki Ishizaki, 



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/04/27 09:30
Subject:[VOTE] Apache Spark 2.1.1 (RC4)



Please vote on releasing the following candidate as Apache Spark version 
2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc4 (
267aca5bd5042303a718d10635bc0d1a1596853f)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1232/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.




Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-19 Thread Kazuaki Ishizaki
+1 (non-binding)

I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. All of the tests for 
core have passed..

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Total number of tests run: 1788
Suites: completed 198, aborted 0
Tests: succeeded 1788, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 16:38 min
[INFO] Finished at: 2017-04-19T18:17:43+09:00
[INFO] Final Memory: 56M/672M
[INFO] 


Regards,
Kazuaki Ishizaki,



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/04/19 04:00
Subject:[VOTE] Apache Spark 2.1.1 (RC3)



Please vote on releasing the following candidate as Apache Spark version 
2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc3 (
2ed19cff2f6ab79a718526e5d16633412d8c4dd4)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1230/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.




Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-02 Thread Kazuaki Ishizaki
Thank you. Yes, it is not a regression. 2.1.0 would have this failure, 
too.

Regards,
Kazuaki Ishizaki



From:   Sean Owen 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP, Michael Armbrust 

Cc: "dev@spark.apache.org" 
Date:   2017/04/02 18:18
Subject:Re: [VOTE] Apache Spark 2.1.1 (RC2)



That backport is fine, for another RC even in my opinion, but it's not a 
regression. It's a JDK bug really. 2.1.0 would have failed too.

On Sun, Apr 2, 2017 at 8:20 AM Kazuaki Ishizaki  
wrote:
-1 (non-binding)

I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. I got several errors.
I expect that this backport (https://github.com/apache/spark/pull/17509) 
will be integrated into Spark 2.1.1.





Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-02 Thread Kazuaki Ishizaki
-1 (non-binding)

I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. I got several errors.
I expect that this backport (https://github.com/apache/spark/pull/17509) 
will be integrated into Spark 2.1.1.


$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 
1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 
package install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
---
 T E S T S
---
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
support was removed in 8.0
Running org.apache.spark.memory.TaskMemoryManagerSuite
Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.445 sec 
<<< FAILURE! - in org.apache.spark.memory.TaskMemoryManagerSuite
encodePageNumberAndOffsetOffHeap(org.apache.spark.memory.TaskMemoryManagerSuite)
 
 Time elapsed: 0.007 sec  <<< ERROR!
java.lang.IllegalArgumentException: requirement failed: No support for 
unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
at 
org.apache.spark.memory.TaskMemoryManagerSuite.encodePageNumberAndOffsetOffHeap(TaskMemoryManagerSuite.java:48)

offHeapConfigurationBackwardsCompatibility(org.apache.spark.memory.TaskMemoryManagerSuite)
 
 Time elapsed: 0.013 sec  <<< ERROR!
java.lang.IllegalArgumentException: requirement failed: No support for 
unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
at 
org.apache.spark.memory.TaskMemoryManagerSuite.offHeapConfigurationBackwardsCompatibility(TaskMemoryManagerSuite.java:138)

Running org.apache.spark.io.NioBufferedFileInputStreamSuite
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.029 sec 
- in org.apache.spark.io.NioBufferedFileInputStreamSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.708 sec 
<<< FAILURE! - in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
testPeakMemoryUsed(org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite) 
 Time elapsed: 0.006 sec  <<< FAILURE!
java.lang.AssertionError: expected:<16648> but was:<16912>

Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Tests run: 13, Failures: 0, Errors: 13, Skipped: 0, Time elapsed: 0.043 
sec <<< FAILURE! - in 
org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
failureToGrow(org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite) 
Time elapsed: 0.002 sec  <<< ERROR!
java.lang.IllegalArgumentException: requirement failed: No support for 
unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
...
Tests run: 207, Failures: 7, Errors: 16, Skipped: 0

Kazuaki Ishizaki



From:   Michael Armbrust 
To: "dev@spark.apache.org" 
Date:   2017/03/31 08:10
Subject:[VOTE] Apache Spark 2.1.1 (RC2)



Please vote on releasing the following candidate as Apache Spark version 
2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes 
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc2 (
02b165dcc2ee5245d1293a375a31660c9d4e1fa6)

List of JIRA tickets resolved can be found with this filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1227/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release 
unless the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.




Re: Why are DataFrames always read with nullable=True?

2017-03-20 Thread Kazuaki Ishizaki
Hi,
Regarding reading part for nullable, it seems to be considered to add a 
data cleaning step as Xiao said at 
https://www.mail-archive.com/user@spark.apache.org/msg39233.html.

Here is a PR https://github.com/apache/spark/pull/17293 to add the data 
cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki



From:   Jason White 
To: dev@spark.apache.org
Date:   2017/03/21 06:31
Subject:Why are DataFrames always read with nullable=True?



If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378


If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file 
or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is 
already
supported now.

Is this functionality still desirable? Is it potentially not applicable 
for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable 
to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






Re: A DataFrame cache bug

2017-02-21 Thread Kazuaki Ishizaki
Hi,
Thank you for pointing out the JIRA.
I think that this JIRA suggests you to insert 
"spark.catalog.refreshByPath(dir)".

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
spark.catalog.refreshByPath(dir)  // insert a NEW statement
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect

Regards,
Kazuaki Ishizaki



From:   gen tang 
To: dev@spark.apache.org
Date:   2017/02/22 15:02
Subject:Re: A DataFrame cache bug



Hi All,

I might find a related issue on jira:

https://issues.apache.org/jira/browse/SPARK-15678

This issue is closed, may be we should reopen it.

Thanks 

Cheers
Gen


On Wed, Feb 22, 2017 at 1:57 PM, gen tang  wrote:
Hi All,

I found a strange bug which is related with reading data from a updated 
path and cache operation.
Please consider the following code:

import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is 
correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect

In fact when we use df1.filter("id>10"), spark will however use old cached 
dataFrame

Any idea? Thanks a lot

Cheers
Gen





Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Kazuaki Ishizaki
Congrats!

Kazuaki Ishizaki



From:   Reynold Xin 
To: "dev@spark.apache.org" 
Date:   2017/02/14 04:18
Subject:welcoming Takuya Ueshin as a new Apache Spark committer



Hi all,

Takuya-san has recently been elected an Apache Spark committer. He's been 
active in the SQL area and writes very small, surgical patches that are 
high quality. Please join me in congratulating Takuya-san!






Re: Spark performance tests

2017-01-10 Thread Kazuaki Ishizaki
Hi,
You may find several micro-benchmarks under 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
.

Regards,
Kazuaki Ishizaki



From:   Prasun Ratn 
To: Apache Spark Dev 
Date:   2017/01/10 12:52
Subject:Spark performance tests



Hi

Are there performance tests or microbenchmarks for Spark - especially
directed towards the CPU specific parts? I looked at spark-perf but
that doesn't seem to have been updated recently.

Thanks
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org






Re: Quick request: prolific PR openers, review your open PRs

2017-01-08 Thread Kazuaki Ishizaki
Sure, I updated status of some PRs.

Regards,
Kazuaki Ishizaki



From:   Sean Owen 
To: dev 
Date:   2017/01/04 21:37
Subject:Quick request: prolific PR openers, review your open PRs



Just saw that there are many people with >= 8 open PRs. Some are 
legitimately in flight but many are probably stale. To set a good example, 
would (everyone) mind flicking through what they've got open and see if 
some PRs are stale and should be closed?

https://spark-prs.appspot.com/users

Username
Open PRs ▴
viirya
13
hhbyyh
12
zhengruifeng
12
HyukjinKwon
12
maropu
10
kiszk
10
yanboliang
10
cloud-fan
8
jerryshao
8














Sharing data in columnar storage between two applications

2016-12-25 Thread Kazuaki Ishizaki
Here is an interesting discussion to share data in columnar storage 
between two applications.
https://github.com/apache/spark/pull/15219#issuecomment-265835049

One of the ideas is to prepare interfaces (or trait) only for read or 
write. Each application can implement only one class to want to do (e.g. 
read or write). For example, FiloDB wants to provide a columnar storage 
that can be read from Spark. In that case, it is easy to implement only 
read APIs for Spark. These two classes can be prepared.
However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch 
keeps a set of ColumnVector that can be read or written. The ColumnVector 
class should have read and write APIs. How can we put the new ColumnVector 
with only read APIs?  Here is an example to case incompatibility at 
https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d

Another possible idea is that both applications supports Apache Arrow 
APIs.
Other approaches could be.

What approach would be good for all of applications?

Regards,
Kazuaki Ishizaki



Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-08 Thread Kazuaki Ishizaki
The line where I pointed out would work correctly. This is because a type 
of this division is double. d2i correctly handles overflow cases.

Kazuaki Ishizaki



From:   Nicholas Chammas 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP, Reynold Xin 

Cc: Spark dev list 
Date:   2016/12/08 10:56
Subject:Re: Reduce memory usage of UnsafeInMemorySorter



Unfortunately, I don't have a repro, and I'm only seeing this at scale. 
But I was able to get around the issue by fiddling with the distribution 
of my data before asking GraphFrames to process it. (I think that's where 
the error was being thrown from.)

On Wed, Dec 7, 2016 at 7:32 AM Kazuaki Ishizaki  
wrote:
I do not have a repro, too.
But, when I took a quick browse at the file 'UnsafeInMemorySort.java', I 
am afraid about the similar cast issue like 
https://issues.apache.org/jira/browse/SPARK-18458at the following line.
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156


Regards,
Kazuaki Ishizaki



From:Reynold Xin 
To:Nicholas Chammas 
Cc:Spark dev list 
Date:2016/12/07 14:27
Subject:Re: Reduce memory usage of UnsafeInMemorySorter



This is not supposed to happen. Do you have a repro?


On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
[Re-titling thread.]
OK, I see that the exception from my original email is being triggered 
from this part of UnsafeInMemorySorter:
https://github.com/apache/spark/blob/v2.0.2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L209-L212

So I can ask a more refined question now: How can I ensure that 
UnsafeInMemorySorterhas room to insert new records? In other words, how 
can I ensure that hasSpaceForAnotherRecord()returns a true value?
Do I need:
More, smaller partitions?
More memory per executor?
Some Java or Spark option enabled?
etc.
I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here? 
(Unfortunately, I cannot upgrade at this time, but it would be good to 
know regardless.)
This is morphing into a user-list question, so accept my apologies. Since 
I can’t find any information anywhere else about this, and the question 
is about internals like UnsafeInMemorySorter, I hope this is OK here.
Nick
On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I was testing out a new project at scale on Spark 2.0.2 running on YARN, 
and my job failed with an interesting error message:
TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name
): java.lang.IllegalStateException: There is no space for new record
05:27:09.573 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211)
05:27:09.574 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127)
05:27:09.574 at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
05:27:09.575 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 
Source)
05:27:09.575 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 
Source)
05:27:09.576 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
05:27:09.576 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$8$anon$1.hasNext(WholeStageCodegenExec.scala:370)
05:27:09.577 at 
scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408)
05:27:09.577 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
05:27:09.577 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
05:27:09.578 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86)
05:27:09.578 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
05:27:09.579 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
05:27:09.579 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
05:27:09.579 at java.lang.Thread.run(Thread.java:745)

I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t 
yield any results. There are no other errors coming from that executor, 
whether related to memory, storage space, or otherwise.
Could this be a bug? If so, how would I narrow down the source? Otherwise, 
how might I work around the issue?
Nick
?
?






Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-07 Thread Kazuaki Ishizaki
I do not have a repro, too.
But, when I took a quick browse at the file 'UnsafeInMemorySort.java', I 
am afraid about the similar cast issue like 
https://issues.apache.org/jira/browse/SPARK-18458 at the following line.
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156

Regards,
Kazuaki Ishizaki



From:   Reynold Xin 
To: Nicholas Chammas 
Cc: Spark dev list 
Date:   2016/12/07 14:27
Subject:Re: Reduce memory usage of UnsafeInMemorySorter



This is not supposed to happen. Do you have a repro?


On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
[Re-titling thread.]
OK, I see that the exception from my original email is being triggered 
from this part of UnsafeInMemorySorter:
https://github.com/apache/spark/blob/v2.0.2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L209-L212
So I can ask a more refined question now: How can I ensure that 
UnsafeInMemorySorter has room to insert new records? In other words, how 
can I ensure that hasSpaceForAnotherRecord() returns a true value?
Do I need:
More, smaller partitions?
More memory per executor?
Some Java or Spark option enabled?
etc.
I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here? 
(Unfortunately, I cannot upgrade at this time, but it would be good to 
know regardless.)
This is morphing into a user-list question, so accept my apologies. Since 
I can’t find any information anywhere else about this, and the question 
is about internals like UnsafeInMemorySorter, I hope this is OK here.
Nick
On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com 
wrote:
I was testing out a new project at scale on Spark 2.0.2 running on YARN, 
and my job failed with an interesting error message:
TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name
): java.lang.IllegalStateException: There is no space for new record
05:27:09.573 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211)
05:27:09.574 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127)
05:27:09.574 at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
05:27:09.575 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 
Source)
05:27:09.575 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 
Source)
05:27:09.576 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
05:27:09.576 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$8$anon$1.hasNext(WholeStageCodegenExec.scala:370)
05:27:09.577 at 
scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408)
05:27:09.577 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
05:27:09.577 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
05:27:09.578 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86)
05:27:09.578 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
05:27:09.579 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
05:27:09.579 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
05:27:09.579 at java.lang.Thread.run(Thread.java:745)

I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t 
yield any results. There are no other errors coming from that executor, 
whether related to memory, storage space, or otherwise.
Could this be a bug? If so, how would I narrow down the source? Otherwise, 
how might I work around the issue?
Nick
​
​





Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi,
Good point. I have just measured performance with 
"spark.sql.inMemoryColumnarStorage.compressed=false."
It improved the performance than with default. However, it is still slower 
RDD version on my environment.

It seems to be consistent with the PR 
https://github.com/apache/spark/pull/11956. This PR shows room to 
performance improvement for float/double values that are not compressed.

Kazuaki Ishizaki



From:   linguin@gmail.com
To: Maciej Bry��ski 
Cc: Spark dev list 
Date:   2016/08/28 11:30
Subject:Re: Cache'ing performance



Hi,

How does the performance difference change when turning off compression?
It is enabled by default.

// maropu

Sent by iPhone

2016/08/28 10:13、Kazuaki Ishizaki  のメッセ�`ジ:

Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:Maciej Bry��ski 
To:Spark dev list 
Date:2016/08/28 05:40
Subject:Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bry��ski





Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:   Maciej Bryński 
To: Spark dev list 
Date:   2016/08/28 05:40
Subject:Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bryński




Question about equality of o.a.s.sql.Row

2016-06-17 Thread Kazuaki Ishizaki
Dear all,

I have three questions about equality of org.apache.spark.sql.Row.

(1) If a Row has a complex type (e.g. Array), is the following behavior 
expected?
If two Rows has the same array instance, Row.equals returns true in the 
second assert. If two Rows has different array instances (a1 and a2) that 
have the same array elements, Row.equals returns false in the third 
assert.

val a1 = Array(3, 4)
val a2 = Array(3, 4)
val r1 = Row(a1)
val r2 = Row(a2)
assert(a1.sameElements(a2)) // SUCCESS
assert(Row(a1).equals(Row(a1)))  // SUCCESS
assert(Row(a1).equals(Row(a2)))  // FAILURE

This is because two objects are compared by "o1 != o2" instead of 
"o1.equals(o2)" at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408

(2) If (1) is expected, where is this behavior is described or defined? I 
cannot find the description in the API document.
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.Row

(3) If (3) is expected, is there any recommendation to write code of 
equality of two Rows that have an Array or complex types (e.g. Map)?

Best Regards,
Kazuaki Ishizaki, @kiszk



Re: How to access the off-heap representation of cached data in Spark 2.0

2016-05-28 Thread Kazuaki Ishizaki
Hi,
According to my understanding, contents in df.cache() is currently on Java 
heap as a set of Byte arrays in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L58
. Data is accessed by using sun.misc.unsafe APIs. Data maybe compressed 
sometime.
CachedBatch is private, and this representation may be changed in the 
future.

In general, It is not easy to access this data by using C/C++ API.

Regards,
Kazuaki Ishizaki



From:   Jacek Laskowski 
To: "jpivar...@gmail.com" 
Cc: dev 
Date:   2016/05/29 08:18
Subject:Re: How to access the off-heap representation of cached 
data in Spark 2.0



Hi Jim,

There's no C++ API in Spark to access the off-heap data. Moreover, I
also think "off-heap" has an overloaded meaning in Spark - for
tungsten and to persist your data off-heap (it's all about memory but
for different purposes and with client- and internal API).

That's my limited understanding of the things (and I'm not even sure
how trustworthy it is). Use with extreme caution.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, May 28, 2016 at 5:29 PM, jpivar...@gmail.com
 wrote:
> Is this not the place to ask such questions? Where can I get a hint as 
to how
> to access the new off-heap cache, or C++ API, if it exists? I'm willing 
to
> do my own research, but I have to have a place to start. (In fact, this 
is
> the first step in that research.)
>
> Thanks,
> -- Jim
>
>
>
>
> --
> View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html

> Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org






Recent Jenkins always fails in specific two tests

2016-04-17 Thread Kazuaki Ishizaki
I realized that recent Jenkins among different pull requests always fails 
in the following two tests
"SPARK-8020: set sql conf in spark conf"
"SPARK-9757 Persist Parquet relation with decimal column"

Here are examples.
https://github.com/apache/spark/pull/11956 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56058/consoleFull
)
https://github.com/apache/spark/pull/12259 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56056/consoleFull
)
https://github.com/apache/spark/pull/12450 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56051/consoleFull
)
https://github.com/apache/spark/pull/12453 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56050/consoleFull
)
https://github.com/apache/spark/pull/12257 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56061/consoleFull
)
https://github.com/apache/spark/pull/12451 (consoleFull: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56045/consoleFull
)

I have just realized that the latest master also causes the same two 
failures at amplab Jenkins. 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/

Since they seem to have some relationships with failures in recent pull 
requests, I created two JIRA entries.
https://issues.apache.org/jira/browse/SPARK-14689
https://issues.apache.org/jira/browse/SPARK-14690

Best regards,
Kazuaki Ishizaki



RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki
Hi Alexander,
The goal of our columnar to effectively drive GPUs in Spark. One of 
important items is to effectively and easily enable highly-tuned libraries 
for GPU such as BIDMach.

We will enable BIDMach with our columnar storage. On the other hand, it is 
not easy task to scaling BIDMach with current Spark. I expect that this 
talk would help us.
http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565

We appreciate your great feedback.

Best Regards,
Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - 
Tokyo



From:   "Ulanov, Alexander" 
To:     Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 
, Joseph Bradley 
Cc: John Canny , "Evan R. Sparks" 
, Xiangrui Meng , Sam Halliday 

Date:   2016/01/22 04:20
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Kazuaki,
 
Indeed, moving data to/from GPU is costly and this benchmark summarizes 
the costs for moving different data sizes with regards to matrices 
multiplication. These costs are paid for the convenience of using the 
standard BLAS API that Nvidia NVBLAS provides. The thing is that there are 
no code changes required (in Spark), one just needs to reference BLAS 
implementation with the system variable. Naturally, hardware-specific 
implementation will always be faster than default. The benchmark results 
show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. 
However, it also shows that it worth using NVBLAS for large matrices 
because it can take advantage of several GPUs and it will be faster 
despite the copying overhead. That is also a known thing advertised by 
Nvidia.
 
By the way, I don’t think that the column/row friendly format is an 
issue, because one can use transposed matrices to fit the required format. 
I believe that is just a software preference.
 
My suggestion with regards to your prototype would be to make comparisons 
with Spark’s implementation of logistic regression (that does not take 
advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). 
It will give the users a better understanding of your’s implementation 
performance. Currently you compare it with Spark’s example logistic 
regression implementation that is supposed to be a reference for learning 
Spark rather than benchmarking its performance.
 
Best regards, Alexander
 
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] 
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra
 
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" 
To:Sam Halliday , John Canny <
ca...@berkeley.edu>
Cc:Xiangrui Meng , "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley , "Evan R. 
Sparks" 
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com] 

RE: Using CUDA within Spark / boosting linear algebra

2016-01-22 Thread Kazuaki Ishizaki
Hi Allen,
Thank you for your feedback.
An API to launch GPU kernels with JCuda is the our first step. A purpose 
to release our prototype is to get feedback. In the future, we may use 
other wrappers instead of JCuda.

We are very appreciate it if you would suggest or propose APIs to 
effectively exploit GPUs such as BIDMat in Spark.
If we would run BIDMat with our columnar strorage, the performance boost 
would be good as others reported.

Best Regards,
Kazuaki Ishizaki,



From:   "Allen Zhang" 
To:     Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: "dev@spark.apache.org" , "Ulanov, Alexander" 
, "Joseph Bradley" , 
"John Canny" , "Evan R. Sparks" 
, "Xiangrui Meng" , "Sam 
Halliday" 
Date:   2016/01/21 21:05
Subject:RE: Using CUDA within Spark / boosting linear algebra




Hi Kazuaki,

Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows 
that 3.15x performance boost of logistic regression seems slower than 
BIDMat-cublas or pure CUDA.
Could you elaborate on why you chose Jcuda other then JNI to call CUDA 
directly?

Regards,
Allen Zhang






At 2016-01-21 19:34:14, "Kazuaki Ishizaki"  wrote:
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:"Ulanov, Alexander" 
To:Sam Halliday , John Canny <
ca...@berkeley.edu>
Cc:Xiangrui Meng , "dev@spark.apache.org" <
dev@spark.apache.org>, Joseph Bradley , "Evan R. 
Sparks" 
Date:2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
 
John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny"  wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make sure 
that every kernel goes at comparable speed. So the real question is how 
much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that required 
running en

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Kazuaki Ishizaki
Dear all,

>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting.  Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance.  If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is done
>>>> in practice).  Having Spark be aware of the GPU and using it as 
another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph

As Joseph pointed out before, there are two potential issues to 
efficiently exploit GPUs in Spark.
(1) the cost of data movement between CPU and GPU
(2) the cost of encoding/decoding between current row-format and 
GPU-friendly column format

Our prototype http://kiszk.github.io/spark-gpu/ addresses these two issues 
by supporting data partition caching in GPU device memory and by providing 
binary column storage for data partition. We really appreciate it if you 
would give us comments, suggestions, or feedback.

Best Regards
Kazuaki Ishizaki



From:   "Ulanov, Alexander" 
To: Sam Halliday , John Canny 

Cc: Xiangrui Meng , "dev@spark.apache.org" 
, Joseph Bradley , "Evan R. 
Sparks" 
Date:   2016/01/21 11:07
Subject:RE: Using CUDA within Spark / boosting linear algebra



Hi Everyone,
 
I’ve updated the benchmark and done experiments with new hardware with 2x 
Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel 
E5-2650 v3 @ 2.30GHz.
 
This time I computed average and median of 10 runs for each of experiment 
and approximated FLOPS.
 
Results are available at google docs (old experiments are in the other 2 
sheets):
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Benchmark code:
https://github.com/avulanov/scala-blas
 
Best regards, Alexander
 
 
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; 
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
 
John, I have to disagree with you there. Dense matrices come up a lot in 
industry,  although your personal experience may be different. 
On 26 Mar 2015 16:20, "John Canny"  wrote:
I mentioned this earlier in the thread, but I'll put it out again. Dense 
BLAS are not very important for most machine learning workloads: at least 
for non-image workloads in industry (and for image processing you would 
probably want a deep learning/SGD solution with convolution kernels). e.g. 
it was only relevant for 1/7 of our recent benchmarks, which should be a 
reasonable sample. What really matters is sparse BLAS performance. BIDMat 
is still an order of magnitude faster there. Those kernels are only in 
BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. 

Its also the case that the overall performance of an algorithm is 
determined by the slowest kernel, not the fastest. If the goal is to get 
closer to BIDMach's performance on typical problems, you need to make sure 
that every kernel goes at comparable speed. So the real question is how 
much faster MLLib routines do on a complete problem with/without GPU 
acceleration. For BIDMach, its close to a factor of 10. But that required 
running entirely on the GPU, and making sure every kernel is close to its 
limit.

-John

If you think nvblas would be helpful, you should try it in some end-to-end 
benchmarks. 
On 3/25/15, 6:23 PM, Evan R. Sparks wrote:
Yeah, much more reasonable - nice to know that we can get full GPU 
performance from breeze/netlib-java - meaning there's no compelling 
performance reason to switch out our current linear algebra library (at 
least as far as this benchmark is concerned). 
 
Instead, it looks like a user guide for configuring Spark/MLlib to use the 
right BLAS library will get us most of the way there. Or, would it make 
sense to finally ship openblas compiled for some common platforms (64-bit 
linux, windows, mac) directly with Spark - hopefully eliminating the jblas 
warnings once and for all for most users? (Licensing is BSD) Or am I 
missing something?
 
On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
alexander.ula...@hp.com> wrote:
As everyone suggested, the results were too good to be true, so I 
double-checked them. It turns that nvblas did not do multiplication due to 
parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My 
previously posted results with nvblas are matrices copying only. The 
default NVBLAS_TILE_DIM==2048

RE: Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki
Hi Alexander,
Thank you for having an interest.

We used a LR derived from a Spark sample program 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala
 
(not from mllib or ml). Here are scala source files for GPU and non-GPU 
versions.
GPU: 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala
non-GPU: 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala

Best Regards,
Kazuaki Ishizaki



From:   "Ulanov, Alexander" 
To:     Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" 

Date:   2016/01/05 06:13
Subject:RE: Support off-loading computations to a GPU



Hi Kazuaki,
 
Sounds very interesting! Could you elaborate on your benchmark with 
regards to logistic regression (LR)? Did you compare your implementation 
with the current implementation of LR in Spark?
 
Best regards, Alexander
 
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] 
Sent: Sunday, January 03, 2016 7:52 PM
To: dev@spark.apache.org
Subject: Support off-loading computations to a GPU
 
Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki




Re:Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki
Hi Allen,
Thank you for having an interest.

For quick start, I prepared a new page "Quick Start" at 
https://github.com/kiszk/spark-gpu/wiki/Quick-Start. You can install the 
package with two lines and run a sample program with one line.

We mean that "off-loading" is to exploit GPU for a task execution of 
Spark. For this, it is necessary to map a task into GPU kernels (While the 
current version requires a programmer to write CUDA code, future versions 
will prepare GPU code from a Spark program automatically). To execute GPU 
kernels requires data copy between CPU and GPU. To reduce data copy 
overhead, our prototype keeps data as a binary representation in RDD using 
a column format.

The current version does not specify the number of CUDA cores for a job by 
using a command line option. There are two ways to specify resources in 
GPU.
1) to specify the number of GPU cards by setting CUDA_VISIBLE_DEVICES in 
conf/spark-env.sh (refer to 
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
)
2) to specify the number of CUDA threads for processing a partition in a 
program as 
https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala#L89
 
(Sorry for no documentation now).

We are glad to support requested features or to looking forward to getting 
pull requests.
 
Best Regard,
Kazuaki Ishizaki



From:   "Allen Zhang" 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: dev@spark.apache.org
Date:   2016/01/04 13:29
Subject:Re:Support off-loading computations to a GPU



Hi Kazuaki,

I am looking at http://kiszk.github.io/spark-gpu/ , can you point me where 
is the kick-start scripts that I can give it a go?

to be more specifically, what does *"off-loading"* mean? aims to reduce 
the copy overhead between CPU and GPU?
I am a newbie for GPU, how can I specify how many GPU cores I want to use 
(like --executor-cores) ?





At 2016-01-04 11:52:01, "Kazuaki Ishizaki"  wrote:
Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki


 




Re: Support off-loading computations to a GPU

2016-01-04 Thread Kazuaki Ishizaki
I created a new JIRA entry 
https://issues.apache.org/jira/browse/SPARK-12620 for this instead of 
reopening the existing JIRA based on the suggestion.

Best Regards,
Kazuaki Ishizaki



From:   Kazuaki Ishizaki/Japan/IBM@IBMJP
To: dev@spark.apache.org
Date:   2016/01/04 12:54
Subject:Support off-loading computations to a GPU



Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki




Support off-loading computations to a GPU

2016-01-03 Thread Kazuaki Ishizaki
Dear all,

We reopened the existing JIRA entry 
https://issues.apache.org/jira/browse/SPARK-3785 to support off-loading 
computations to a GPU by adding a description for our prototype. We are 
working to effectively and easily exploit GPUs on Spark at 
http://github.com/kiszk/spark-gpu. Please also visit our project page 
http://kiszk.github.io/spark-gpu/.

For now, we added a new format for a partition in an RDD, which is a 
column-based structure in an array format, in addition to the current 
Iterator[T] format with Seq[T]. This reduces data 
serialization/deserialization and copy overhead between CPU and GPU.

Our prototype achieved more than 3x performance improvement for a simple 
logistic regression program using a NVIDIA K40 card.

This JIRA entry (SPARK-3785) includes a link to a design document. We are 
very glad to hear valuable feedback/suggestions/comments and to have great 
discussions to exploit GPUs in Spark.

Best Regards,
Kazuaki Ishizaki



Re: latest Spark build error

2015-12-24 Thread Kazuaki Ishizaki
This is because to build Spark requires maven 3.3.3 or later.
http://spark.apache.org/docs/latest/building-spark.html

Regards,
Kazuaki Ishizaki



From:   salexln 
To: dev@spark.apache.org
Date:   2015/12/25 15:52
Subject:latest Spark build error



 Hi all,

I'm getting build error when trying to build a clean version of latest
Spark. I did the following

1) git clone https://github.com/apache/spark.git
2) build/mvn -DskipTests clean package

But I get the following error:

Spark Project Parent POM .. FAILURE [2.338s]
...
BUILD FAILURE
...
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce
(enforce-versions) on project spark-parent_2.10: Some Enforcer rules have
failed. Look above for specific messages explaining why the rule failed. 
->
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the 
-e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, 
please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException



I'm running Lubuntu 14.04 with the following:

java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
Apache Maven 3.0.5 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/latest-Spark-build-error-tp15782.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org






Re: Shared memory between C++ process and Spark

2015-12-07 Thread Kazuaki Ishizaki
Is this JIRA entry related to what you want?
https://issues.apache.org/jira/browse/SPARK-10399

Regards,
Kazuaki Ishizaki



From:   Jia 
To: Dewful 
Cc: "user @spark" , dev@spark.apache.org, Robin 
East 
Date:   2015/12/08 03:17
Subject:Re: Shared memory between C++ process and Spark



Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that 
can connect to multiple storages.
However, because our data is also hold in memory, I suspect that 
connecting to Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a 
very efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample 
c++ bindings, not sure how much of the current functionality they 
support...
Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing 
list.
Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy 
analytics on those data. But we need high performance, that’s why we want 
shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so 
you’ll get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’
m sure it would be possible with enough tinkering but it’s not clear what 
you are trying to achieve. Spark is a distributed processing system, it 
has multiple JVMs running on different machines that each run a small part 
of the overall processing. Unless you have some sort of idea to have 
multiple C++ processes collocated with the distributed JVMs using named 
memory mapped files doesn’t make architectural sense. 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:

Dears, for one project, I need to implement something so Spark can read 
data from a C++ process. 
To provide high performance, I really hope to implement this through 
shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org