Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-18 Thread Hyukjin Kwon
Similar issues are going on in spark-website as well. I also filed a ticket
at https://issues.apache.org/jira/browse/INFRA-17469.

2018년 12월 12일 (수) 오전 9:02, Reynold Xin 님이 작성:

> I filed a ticket: https://issues.apache.org/jira/browse/INFRA-17403
>
> Please add your support there.
>
>
> On Tue, Dec 11, 2018 at 4:58 PM, Sean Owen  wrote:
>
>> I asked on the original ticket at
>> https://issues.apache.org/jira/browse/INFRA-17385 but no follow-up. Go
>> ahead and open a new INFRA ticket.
>>
>> On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin  wrote:
>>
>>> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so
>>> I want to put some pressure myself there too.
>>>
>>>
>>> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen  wrote:
>>>
 Agree, I'll ask on the INFRA ticket and follow up. That's a lot of
 extra noise.

 On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin 
 wrote:

 Hmm, it also seems that github comments are being sync'ed to jira.
 That's gonna get old very quickly, we should probably ask infra to disable
 that (if we can't do it ourselves).
 On Mon, Dec 10, 2018 at 9:13 AM Sean Owen  wrote:

 Update for committers: now that my user ID is synced, I can
 successfully push to remote https://github.com/apache/spark directly.
 Use that as the 'apache' remote (if you like; gitbox also works). I
 confirmed the sync works both ways.

 As a bonus you can directly close pull requests when needed instead of
 using "Close Stale PRs" pull requests.

 On Mon, Dec 10, 2018 at 10:30 AM Sean Owen  wrote:

 Per the thread last week, the Apache Spark repos have migrated from
 https://git-wip-us.apache.org/repos/asf to
 https://gitbox.apache.org/repos/asf

 Non-committers:

 This just means repointing any references to the old repository to the
 new one. It won't affect you if you were already referencing
 https://github.com/apache/spark .

 Committers:

 Follow the steps at https://reference.apache.org/committer/github to
 fully sync your ASF and Github accounts, and then wait up to an hour for it
 to finish.

 Then repoint your git-wip-us remotes to gitbox in your git checkouts.
 For our standard setup that works with the merge script, that should be
 your 'apache' remote. For example here are my current remotes:

 $ git remote -v
 apache https://gitbox.apache.org/repos/asf/spark.git (fetch) apache
 https://gitbox.apache.org/repos/asf/spark.git (push) apache-github
 git://github.com/apache/spark (fetch) apache-github git://
 github.com/apache/spark (push) origin https://github.com/srowen/spark
 (fetch)
 origin https://github.com/srowen/spark (push)
 upstream https://github.com/apache/spark (fetch)
 upstream https://github.com/apache/spark (push)

 In theory we also have read/write access to github.com now too, but
 right now it hadn't yet worked for me. It may need to sync. This note just
 makes sure anyone knows how to keep pushing commits right now to the new
 ASF repo.

 Report any problems here!

 Sean

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
 Marcelo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>


Re: DataSourceV2 sync notes (#4)

2018-12-18 Thread Srabasti Banerjee
 Thanks for sending out the meeting notes from last week's discussion Ryan!
For technical unknown reasons, I could not unmute myself and be heard when I 
was trying to pitch in during one of the topic discussions regarding default 
value handling for traditional databases. Had posted response in chat. 

My 2 cents regarding traditional database handling for default values - From my 
industry experience, Oracle has a constraint clause "ENABLE NOVALIDATE" that 
enables new rows to be added going forward to be added with default value.  
Previous older rows/data are not required to be updated a default value. One 
can choose to do a data fix, at any point though.

Happy Holidays All in advance :-)

Warm Regards,
Srabasti Banerjee
On Tuesday, 18 December, 2018, 4:15:06 PM GMT-8, Ryan Blue 
 wrote:  
 
 
Hi everyone, sorry these notes are late. I didn’t have the time to write this 
up last week.

For anyone interested in the next sync, we decided to skip next week and resume 
in early January. I’ve already sent the invite. As usual, if you have topics 
you’d like to discuss or would like to be added to the invite list, just let me 
know. Everyone is welcome.

rb

Attendees:
Ryan Blue
Xiao Li
Bruce Robbins
John Zhuge
Anton Okolnychyi
Jackey Lee
Jamison Bennett
Srabasti Banerjee
Thomas D’Silva
Wenchen Fan
Matt Cheah
Maryann Xue
(possibly others that entered after the start)

Agenda:
   
   - Current discussions from the v2 batch write PR: WriteBuilder and SaveMode
   - Continue sql-api discussion after looking at API dependencies
   - Capabilities API
   - Overview of TableCatalog proposal to sync understanding (if time)

Notes:
   
   - WriteBuilder:  
  - Wenchen summarized the options (factory methods vs builder) and some 
trade-offs
  - What we need to accomplish now can be done with factory methods, which 
are simpler
  - A builder matches the structure of the read side
  - Ryan’s opinion is to use the builder for consistency and evolution. 
Builder makes it easier to change or remove parts without copying all of the 
args of a method.
  - Matt’s opinion is that evolution and maintenance is easier and good to 
match the read side
  - Consensus was to use WriteBuilder instead of factory methods

   - SaveMode:  
  - Context: v1 passes SaveMode from the DataFrameWriter API to sources. 
The action taken for some mode and existing table state depends on the source 
implementation, which is something the community wants to fix in v2. But, v2 
initially passed SaveMode to sources. The question is how and when to remove 
SaveMode.
  - Wenchen: the current API uses SaveMode and we don’t want to drop 
features
  - Ryan: The main requirement is removing this before the next release. We 
should not have a substantial API change without removing it because we would 
still require an API change.
  - Xiao: suggested creating a release-blocking issue.
  - Consensus was to remove SaveMode before the next release, blocking if 
needed.
  - Someone also stated that keeping SaveMode would make porting file 
sources to v2 easier
  - Ryan disagrees that using SaveMode makes porting file sources faster or 
easier.

   - Capatbilities API (this is a quick overview of a long conversation)  
  - Context: there are several situations where a source needs to change 
how Spark behaves or Spark needs to check whether a source supports some 
feature. For example, Spark checks whether a source supports batch writes, 
write-only sources that do not need validation need to tell Spark not to run 
validation rules, and sources that can read files with missing columns (e.g., 
Iceberg) need Spark to allow writes that are missing columns if those columns 
are optional or have default values.
  - Xiao suggested handling this case by case and the conversation moved to 
discussing the motivating case for Netflix: allowing writes that do not include 
optional columns.
  - Wenchen and Maryann added that Spark should handle all default values 
so that this doesn’t differ across sources. Ryan agreed that would be good, but 
pointed out challenges.
  - There was a long discussion about how Spark could handle default 
values. The problem is that adding a column with a default creates a problem of 
reading older data. Maryann and Dilip pointed out that traditional databases 
handle default values at write time so the correct default is the default value 
at write time (instead of read time), but it is unclear how existing data is 
handled.
  - Matt and Ryan asked whether databases update existing rows when a 
default is added. But even if a database can update all existing rows, that 
would not be reasonable for Spark, which in the worst case would need to update 
millions of immutable files. This is also not a reasonable requirement to put 
on sources, so Spark would need to have read-side defaults.
  - Xiao noted that it may be easier to treat internal and 

[DISCUSS] Default values and data sources

2018-12-18 Thread Ryan Blue
Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2
community sync last week.

The problem I’m trying to solve is that the format I’m using DSv2 to
integrate supports schema evolution. Specifically, adding a new optional
column so that rows without that column get a default value (null for
Iceberg). The current validation rule for an append in DSv2 fails a write
if it is missing a column, so adding a column to an existing table will
cause currently-scheduled jobs that insert data to start failing. Clearly,
schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check
whether to fail when an optional column is missing. Other contributors in
the sync thought that Spark should go with option 2: Spark’s schema should
have defaults and Spark should handle filling in defaults the same way
across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is
very hard to implement.

A source might manage data stored in millions of immutable Parquet files,
so adding a default value isn’t possible. Spark would need to fill in
defaults for files written before the column was added at read time (it
could fill in defaults in new files at write time). Filling in defaults at
read time would require Spark to fill in defaults for only some of the
files in a scan, so Spark would need different handling for each task
depending on the schema of that task. Tasks would also be required to
produce a consistent schema, so a file without the new column couldn’t be
combined into a task with a file that has the new column. This adds quite a
bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC
source would be capable of filling in the default values itself, so Spark
would need some way to communicate the default to that source. If the
source had a different policy for default values (write time instead of
read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default
values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger
group to discuss. Please reply with comments!

rb
-- 
Ryan Blue
Software Engineer
Netflix


DataSourceV2 sync notes (#4)

2018-12-18 Thread Ryan Blue
Hi everyone, sorry these notes are late. I didn’t have the time to write
this up last week.

For anyone interested in the next sync, we decided to skip next week and
resume in early January. I’ve already sent the invite. As usual, if you
have topics you’d like to discuss or would like to be added to the invite
list, just let me know. Everyone is welcome.

rb

*Attendees*:
Ryan Blue
Xiao Li
Bruce Robbins
John Zhuge
Anton Okolnychyi
Jackey Lee
Jamison Bennett
Srabasti Banerjee
Thomas D’Silva
Wenchen Fan
Matt Cheah
Maryann Xue
(possibly others that entered after the start)

*Agenda*:

   - Current discussions from the v2 batch write PR: WriteBuilder and
   SaveMode
   - Continue sql-api discussion after looking at API dependencies
   - Capabilities API
   - Overview of TableCatalog proposal to sync understanding (if time)

*Notes*:

   - WriteBuilder:
  - Wenchen summarized the options (factory methods vs builder) and
  some trade-offs
  - What we need to accomplish now can be done with factory methods,
  which are simpler
  - A builder matches the structure of the read side
  - Ryan’s opinion is to use the builder for consistency and evolution.
  Builder makes it easier to change or remove parts without copying all of
  the args of a method.
  - Matt’s opinion is that evolution and maintenance is easier and good
  to match the read side
  - *Consensus was to use WriteBuilder instead of factory methods*
   - SaveMode:
  - Context: v1 passes SaveMode from the DataFrameWriter API to
  sources. The action taken for some mode and existing table state
depends on
  the source implementation, which is something the community
wants to fix in
  v2. But, v2 initially passed SaveMode to sources. The question is how and
  when to remove SaveMode.
  - Wenchen: the current API uses SaveMode and we don’t want to drop
  features
  - Ryan: The main requirement is removing this before the next
  release. We should not have a substantial API change without removing it
  because we would still require an API change.
  - Xiao: suggested creating a release-blocking issue.
  - *Consensus was to remove SaveMode before the next release, blocking
  if needed.*
  - Someone also stated that keeping SaveMode would make porting file
  sources to v2 easier
  - Ryan disagrees that using SaveMode makes porting file sources
  faster or easier.
   - Capatbilities API (this is a quick overview of a long conversation)
  - Context: there are several situations where a source needs to
  change how Spark behaves or Spark needs to check whether a
source supports
  some feature. For example, Spark checks whether a source supports batch
  writes, write-only sources that do not need validation need to tell Spark
  not to run validation rules, and sources that can read files with missing
  columns (e.g., Iceberg) need Spark to allow writes that are
missing columns
  if those columns are optional or have default values.
  - Xiao suggested handling this case by case and the conversation
  moved to discussing the motivating case for Netflix: allowing writes that
  do not include optional columns.
  - Wenchen and Maryann added that Spark should handle all default
  values so that this doesn’t differ across sources. Ryan agreed that would
  be good, but pointed out challenges.
  - There was a long discussion about how Spark could handle default
  values. The problem is that adding a column with a default creates a
  problem of reading older data. Maryann and Dilip pointed out that
  traditional databases handle default values at write time so the correct
  default is the default value at write time (instead of read time), but it
  is unclear how existing data is handled.
  - Matt and Ryan asked whether databases update existing rows when a
  default is added. But even if a database can update all existing
rows, that
  would not be reasonable for Spark, which in the worst case would need to
  update millions of immutable files. This is also not a reasonable
  requirement to put on sources, so Spark would need to have read-side
  defaults.
  - Xiao noted that it may be easier to treat internal and external
  sources differently so internal sources to handle defaults. Ryan pointed
  out that this is the motivation for adding a capability API.
  - *Consensus was to start a discuss thread on the dev list about
  default values.*
  - Discussion shifted to a different example: the need to disable
  validation for write-only tables. Consensus was that this use
case is valid.
  - Wenchen: capabilities would work to disable write validation, but
  should not be string based.
  - *Consensus was to use a capabilities API, but use an enum instead
  of strings.*
  - Open question: what other options should use a 

Re: [DISCUSS] Function plugins

2018-12-18 Thread Ryan Blue
I agree that it probably isn’t feasible to support codegen.

My goal is to be able to have users code like they can in Scala, but change
registration so that they don’t need a SparkSession. This is easy with a
SparkSession:

In [2]: def plus(a: Int, b: Int): Int = a + b
plus: (a: Int, b: Int)Int

In [3]: spark.udf.register("plus", plus _)
Out[3]: UserDefinedFunction(,IntegerType,Some(List(IntegerType,
IntegerType)))

In [4]: %%sql
  : select plus(3,4)

Out[4]:
++
| UDF:plus(3, 4) |
++
| 7  |
++
  available as df0

I want to build a UDFCatalog that can handle indirect registration: a user
registers plus with some class that I control, and that class uses the
UDFCatalog interface to pass those UDFs to Spark. It would also handle the
translation to Spark’s UserDefinedFunction, just like when you use
spark.udf.register.

On Fri, Dec 14, 2018 at 7:02 PM Reynold Xin  wrote:

> I don’t think it is realistic to support codegen for UDFs. It’s hooked
> deep into intervals.
>
> On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah  wrote:
>
>> How would this work with:
>>
>>1. Codegen – how does one generate code given a user’s UDF? Would the
>>user be able to specify the code that is generated that represents their
>>function? In practice that’s pretty hard to get right.
>>2. Row serialization and representation – Will the UDF receive
>>catalyst rows with optimized internal representations, or will Spark have
>>to convert to something more easily consumed by a UDF?
>>
>>
>>
>> Otherwise +1 for trying to get this to work without Hive. I think even
>> having something without codegen and optimized row formats is worthwhile if
>> only because it’s easier to use than Hive UDFs.
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *Reynold Xin 
>> *Date: *Friday, December 14, 2018 at 1:49 PM
>> *To: *"rb...@netflix.com" 
>> *Cc: *Spark Dev List 
>> *Subject: *Re: [DISCUSS] Function plugins
>>
>>
>>
>> [image: Image removed by sender.]
>>
>> Having a way to register UDFs that are not using Hive APIs would be great!
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue 
>> wrote:
>>
>> Hi everyone,
>> I’ve been looking into improving how users of our Spark platform register
>> and use UDFs and I’d like to discuss a few ideas for making this easier.
>>
>> The motivation for this is the use case of defining a UDF from SparkSQL
>> or PySpark. We want to make it easy to write JVM UDFs and use them from
>> both SQL and Python. Python UDFs work great in most cases, but we
>> occasionally don’t want to pay the cost of shipping data to python and
>> processing it there so we want to make it easy to register UDFs that will
>> run in the JVM.
>>
>> There is already syntax to create a function from a JVM class
>> [docs.databricks.com]
>> 
>> in SQL that would work, but this option requires using the Hive UDF API
>> instead of Spark’s simpler Scala API. It also requires argument translation
>> and doesn’t support codegen. Beyond the problem of the API and performance,
>> it is annoying to require registering every function individually with a 
>> CREATE
>> FUNCTION statement.
>>
>> The alternative that I’d like to propose is to add a way to register a
>> named group of functions using the proposed catalog plugin API.
>>
>> For anyone unfamiliar with the proposed catalog plugins, the basic idea
>> is to load and configure plugins using a simple property-based scheme.
>> Those plugins expose functionality through mix-in interfaces, like
>> TableCatalog to create/drop/load/alter tables. Another interface could
>> be UDFCatalog that can load UDFs.
>>
>> interface UDFCatalog extends CatalogPlugin {
>>
>>   UserDefinedFunction loadUDF(String name)
>>
>> }
>>
>> To use this, I would create a UDFCatalog class that returns my Scala
>> functions as UDFs. To look up functions, we would use both the catalog name
>> and the function name.
>>
>> This would allow my users to write Scala UDF instances, package them
>> using a UDFCatalog class (provided by me), and easily use them in Spark
>> with a few configuration options to add the catalog in their environment.
>>
>> This would also allow me to expose UDF libraries easily in my
>> configuration, like brickhouse [community.cloudera.com]
>> ,
>> 

[Standalone Spark] Master Configuration Push-Down

2018-12-18 Thread Sean Po
I am running Spark Standalone mode and I am finding that when I configure ports 
(i.e. spark.blockManager.port) in both the Spark Master's spark-defaults.conf 
as well as the Spark Worker's, that the Spark Master's port is the one that 
will be used in all the workers. Judging by the code, this seems to be done by 
design. If executor sizes are small, then the 16 ports attempted will be 
exhausted, and executors will fail to start. This is further exacerbated by the 
fact that multiple Spark Workers can exist on the same machine in my particular 
circumstance.

What are the community's thoughts on changing this behavior such that

  1.  The port push down will only happen if the Spark Worker's port 
configuration is not set. This won't solve the problem, but will mitigate it 
and seems to make sense from a user experience point of view.

Similarly, I'd like to prevent environment variable push down as well. Perhaps 
instead of 1. if we can have a configurable switch to turn off push down of 
port configuration and a different one to turn off environment variable push 
down, this will work too.

Please share some of your thoughts 

Regards,
Sean


Re: Decimals with negative scale

2018-12-18 Thread Reynold Xin
So why can't we just do validation to fail sources that don't support negative 
scale, if it is not supported? This way, we don't need to break backward 
compatibility in anyway and it becomes a strict improvement.

On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido < marcogaid...@gmail.com > wrote:

> 
> This is at analysis time.
> 
> On Tue, 18 Dec 2018, 17:32 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) wrote:
> 
> 
>> Is this an analysis time thing or a runtime thing?
>> 
>> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido < marcogaido91@ gmail. com (
>> marcogaid...@gmail.com ) > wrote:
>> 
>> 
>>> Hi all,
>>> 
>>> 
>>> as you may remember, there was a design doc to support operations
>>> involving decimals with negative scales. After the discussion in the
>>> design doc, now the related PR is blocked because for 3.0 we have another
>>> option which we can explore, ie. forbidding negative scales. This is
>>> probably a cleaner solution, as most likely we didn't want negative
>>> scales, but it is a breaking change: so we wanted to check the opinion of
>>> the community.
>>> 
>>> 
>>> Getting to the topic, here there are the 2 options:
>>> * - Forbidding negative scales*
>>>   Pros: many sources do not support negative scales (so they can create
>>> issues); they were something which was not considered as possible in the
>>> initial implementation, so we get to a more stable situation.
>>>   Cons: some operations which were supported earlier, won't be working
>>> anymore. Eg. since our max precision is 38, if the scale cannot be
>>> negative 1e36 * 1e36 would cause an overflow, while now works fine
>>> (producing a decimal with negative scale); basically impossible to create
>>> a config which controls the behavior.
>>> 
>>> 
>>> 
>>>  *- Handling negative scales in operations*
>>>   Pros: no regressions; we support all the operations we supported on 2.x.
>>> 
>>>   Cons: negative scales can cause issues in other moments, eg. when saving
>>> to a data source which doesn't support them.
>>> 
>>> 
>>> 
>>> Looking forward to hear your thoughts,
>>> Thanks.
>>> Marco
>>> 
>> 
>> 
> 
>

Re: Decimals with negative scale

2018-12-18 Thread Marco Gaido
This is at analysis time.

On Tue, 18 Dec 2018, 17:32 Reynold Xin  Is this an analysis time thing or a runtime thing?
>
> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> as you may remember, there was a design doc to support operations
>> involving decimals with negative scales. After the discussion in the design
>> doc, now the related PR is blocked because for 3.0 we have another option
>> which we can explore, ie. forbidding negative scales. This is probably a
>> cleaner solution, as most likely we didn't want negative scales, but it is
>> a breaking change: so we wanted to check the opinion of the community.
>>
>> Getting to the topic, here there are the 2 options:
>> * - Forbidding negative scales*
>>   Pros: many sources do not support negative scales (so they can create
>> issues); they were something which was not considered as possible in the
>> initial implementation, so we get to a more stable situation.
>>   Cons: some operations which were supported earlier, won't be working
>> anymore. Eg. since our max precision is 38, if the scale cannot be negative
>> 1e36 * 1e36 would cause an overflow, while now works fine (producing a
>> decimal with negative scale); basically impossible to create a config which
>> controls the behavior.
>>
>>  *- Handling negative scales in operations*
>>   Pros: no regressions; we support all the operations we supported on 2.x.
>>   Cons: negative scales can cause issues in other moments, eg. when
>> saving to a data source which doesn't support them.
>>
>> Looking forward to hear your thoughts,
>> Thanks.
>> Marco
>>
>>
>>


Decimals with negative scale

2018-12-18 Thread Marco Gaido
Hi all,

as you may remember, there was a design doc to support operations involving
decimals with negative scales. After the discussion in the design doc, now
the related PR is blocked because for 3.0 we have another option which we
can explore, ie. forbidding negative scales. This is probably a cleaner
solution, as most likely we didn't want negative scales, but it is a
breaking change: so we wanted to check the opinion of the community.

Getting to the topic, here there are the 2 options:
* - Forbidding negative scales*
  Pros: many sources do not support negative scales (so they can create
issues); they were something which was not considered as possible in the
initial implementation, so we get to a more stable situation.
  Cons: some operations which were supported earlier, won't be working
anymore. Eg. since our max precision is 38, if the scale cannot be negative
1e36 * 1e36 would cause an overflow, while now works fine (producing a
decimal with negative scale); basically impossible to create a config which
controls the behavior.

 *- Handling negative scales in operations*
  Pros: no regressions; we support all the operations we supported on 2.x.
  Cons: negative scales can cause issues in other moments, eg. when saving
to a data source which doesn't support them.

Looking forward to hear your thoughts,
Thanks.
Marco


[GitHub] HyukjinKwon closed pull request #162: Add a note about Spark build requirement at PySpark testing guide in Developer Tools

2018-12-18 Thread GitBox
HyukjinKwon closed pull request #162: Add a note about Spark build requirement 
at PySpark testing guide in Developer Tools
URL: https://github.com/apache/spark-website/pull/162
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/developer-tools.md b/developer-tools.md
index ebe6905fa..43ad445d6 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -131,6 +131,8 @@ build/mvn test -DwildcardSuites=none 
-Dtest=org.apache.spark.streaming.JavaAPISu
 Testing PySpark
 
 To run individual PySpark tests, you can use `run-tests` script under `python` 
directory. Test cases are located at `tests` package under each PySpark 
packages.
+Note that, if you add some changes into Scala or Python side in Apache Spark, 
you need to manually build Apache Spark again before running PySpark tests in 
order to apply the changes.
+Running PySpark testing script does not automatically build it.
 
 To run test cases in a specific module:
 
diff --git a/site/developer-tools.html b/site/developer-tools.html
index 82dab671a..710f6f53e 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -313,7 +313,9 @@ Testing with Maven
 
 Testing PySpark
 
-To run individual PySpark tests, you can use run-tests script 
under python directory. Test cases are located at 
tests package under each PySpark packages.
+To run individual PySpark tests, you can use run-tests script 
under python directory. Test cases are located at 
tests package under each PySpark packages.
+Note that, if you add some changes into Scala or Python side in Apache Spark, 
you need to manually build Apache Spark again before running PySpark tests in 
order to apply the changes.
+Running PySpark testing script does not automatically build it.
 
 To run test cases in a specific module:
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] HyukjinKwon commented on issue #162: Add a note about Spark build requirement at PySpark testing guide in Developer Tools

2018-12-18 Thread GitBox
HyukjinKwon commented on issue #162: Add a note about Spark build requirement 
at PySpark testing guide in Developer Tools
URL: https://github.com/apache/spark-website/pull/162#issuecomment-448164740
 
 
   Thanks guys!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org