Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Wenchen Fan
Thanks Marcelo to point out my gpg key issue! I've re-generated it and
uploaded to ASF spark repo. Let's see if it works in the next RC.

Thanks Saisai to point out the Python doc issue, I'll fix it in the next RC.

This RC fails because:
1. it doesn't include a Scala 2.12 build
2. the gpg key issue
3. the Python doc issue
4. some other potential blocker issues.

I'll start RC2 once these blocker issues are either resolved or we decide
to mark them as non-blocker.

Thanks,
Wenchen

On Tue, Sep 18, 2018 at 9:48 PM Marco Gaido  wrote:

> Sorry but I am -1 because of what was reported here:
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104
> .
> It is a regression unfortunately. Despite the impact is not huge and there
> are workarounds, I think we should include the fix in 2.4.0. I created
> SPARK-25454 and submitted a PR for it.
> Sorry for the trouble.
>
> Il giorno mar 18 set 2018 alle ore 05:23 Holden Karau <
> hol...@pigscanfly.ca> ha scritto:
>
>> Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC
>> vote thread. Personally I think we might be a little too late in the game
>> to deprecate it in 2.4, but I think calling it out as "soon to be
>> deprecated" in the release docs would be sensible to give folks extra time
>> to prepare.
>>
>> On Mon, Sep 17, 2018 at 2:04 PM Erik Erlandson 
>> wrote:
>>
>>>
>>> I have no binding vote but I second Stavros’ recommendation for
>>> spark-23200
>>>
>>> Per parallel threads on Py2 support I would also like to propose
>>> deprecating Py2 starting with this 2.4 release
>>>
>>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>>  wrote:
>>>
>>>> You can log in to https://repository.apache.org and see what's wrong.
>>>> Just find that staging repo and look at the messages. In your case it
>>>> seems related to your signature.
>>>>
>>>> failureMessageNo public key: Key with id: () was not able to be
>>>> located on http://gpg-keyserver.de/. Upload your public key and try
>>>> the operation again.
>>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>>>> wrote:
>>>> >
>>>> > I confirmed that
>>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>>>> error message during it.
>>>> >
>>>> > Any insights are appreciated! So that I can fix it in the next RC.
>>>> Thanks!
>>>> >
>>>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>>> >>
>>>> >> I think one build is enough, but haven't thought it through. The
>>>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
>>>> it?
>>>> >> Really, whatever's the easy thing to do.
>>>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>>> wrote:
>>>> >> >
>>>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>>> Scala 2.12?
>>>> >> >
>>>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>>>> wrote:
>>>> >> >>
>>>> >> >> A few preliminary notes:
>>>> >> >>
>>>> >> >> Wenchen for some weird reason when I hit your key in gpg
>>>> --import, it
>>>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>>>> verify
>>>> >> >> the signature. No issue there really.
>>>> >> >>
>>>> >> >> The staging repo gives a 404:
>>>> >> >>
>>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>>>> >> >>
>>>> >> >> The (revamped) licenses are

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Wenchen Fan
expressions/functions can be expensive and I do think Spark should trust
data source and not re-apply pushed filters. If data source lies, many
things can go wrong...

On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke  wrote:

> Well even if it has to apply it again, if pushdown is activated then it
> will be much less cost for spark to see if the filter has been applied or
> not. Applying the filter is negligible, what it really avoids if the file
> format implements it is IO cost (for reading) as well as cost for
> converting from the file format internal datatype to the one of Spark.
> Those two things are very expensive, but not the filter check. In the end,
> it could be also data source internal reasons not to apply a filter (there
> can be many depending on your scenario, the format etc). Instead of
> “discussing” between Spark and the data source it is much less costly that
> Spark checks that the filters are consistently applied.
>
> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando <
> alessandro.solima...@gmail.com>:
>
> Hello,
> that's an interesting question, but after Frank's reply I am a bit puzzled.
>
> If there is no control over the pushdown status how can Spark guarantee
> the correctness of the final query?
>
> Consider a filter pushed down to the data source, either Spark has to know
> if it has been applied or not, or it has to re-apply the filter anyway (and
> pay the price for that).
>
> Is there any other option I am not considering?
>
> Best regards,
> Alessandro
>
> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke  ha
> scritto:
>
>> BTW. Even for json a pushdown can make sense to avoid that data is
>> unnecessary ending in Spark ( because it would cause unnecessary overhead).
>> In the datasource v2 api you need to implement a SupportsPushDownFilter
>>
>> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama > >:
>> >
>> > Hi,
>> >
>> > I'm a support engineer, interested in DataSourceV2.
>> >
>> > Recently I had some pain to troubleshoot to check if pushdown is
>> actually applied or not.
>> > I noticed that DataFrame's explain() method shows pushdown even for
>> JSON.
>> > It totally depends on DataSource side, I believe. However, I would like
>> Spark to have some way to confirm whether specific pushdown is actually
>> applied in DataSource or not.
>> >
>> > # Example
>> > val df = spark.read.json("s3://sample_bucket/people.json")
>> > df.printSchema()
>> > df.filter($"age" > 20).explain()
>> >
>> > root
>> >  |-- age: long (nullable = true)
>> >  |-- name: string (nullable = true)
>> >
>> > == Physical Plan ==
>> > *Project [age#47L, name#48]
>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>> >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
>> Location: InMemoryFileIndex[s3://sample_bucket/people.json],
>> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],
>> ReadSchema: struct
>> >
>> > # Comments
>> > As you can see, PushedFilter is shown even if input data is JSON.
>> > Actually this pushdown is not used.
>> >
>> > I'm wondering if it has been already discussed or not.
>> > If not, this is a chance to have such feature in DataSourceV2 because
>> it would require some API level changes.
>> >
>> >
>> > Warm regards,
>> >
>> > Noritaka Sekiyama
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column
is added, but we should also try out best to make the behavior same as
RDBMS/SQL standard.

1. it should be the user who decides the default value of a column, by
CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
2. When adding a new column, the default value should be effective for all
the existing data, and newly written data.
3. When altering an existing column and change the default value, it should
be effective for newly written data only.

A possible implementation:
1. a columnn has 2 default values: the initial one and the latest one.
2. when adding a column with a default value, set both the initial one and
the latest one to this value. But do not update existing data.
3. when reading data, fill the missing column with the initial default value
4. when writing data, fill the missing column with the latest default value
5. when altering a column to change its default value, only update the
latest default value.

This works because:
1. new files will be written with the latest default value, nothing we need
to worry about at read time.
2. old files will be read with the initial default value, which returns
expected result.

On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue  wrote:

> Hi everyone,
>
> This thread is a follow-up to a discussion that we started in the DSv2
> community sync last week.
>
> The problem I’m trying to solve is that the format I’m using DSv2 to
> integrate supports schema evolution. Specifically, adding a new optional
> column so that rows without that column get a default value (null for
> Iceberg). The current validation rule for an append in DSv2 fails a write
> if it is missing a column, so adding a column to an existing table will
> cause currently-scheduled jobs that insert data to start failing. Clearly,
> schema evolution shouldn't break existing jobs that produce valid data.
>
> To fix this problem, I suggested option 1: adding a way for Spark to check
> whether to fail when an optional column is missing. Other contributors in
> the sync thought that Spark should go with option 2: Spark’s schema should
> have defaults and Spark should handle filling in defaults the same way
> across all sources, like other databases.
>
> I think we agree that option 2 would be ideal. The problem is that it is
> very hard to implement.
>
> A source might manage data stored in millions of immutable Parquet files,
> so adding a default value isn’t possible. Spark would need to fill in
> defaults for files written before the column was added at read time (it
> could fill in defaults in new files at write time). Filling in defaults at
> read time would require Spark to fill in defaults for only some of the
> files in a scan, so Spark would need different handling for each task
> depending on the schema of that task. Tasks would also be required to
> produce a consistent schema, so a file without the new column couldn’t be
> combined into a task with a file that has the new column. This adds quite a
> bit of complexity.
>
> Other sources may not need Spark to fill in the default at all. A JDBC
> source would be capable of filling in the default values itself, so Spark
> would need some way to communicate the default to that source. If the
> source had a different policy for default values (write time instead of
> read time, for example) then behavior could still be inconsistent.
>
> I think that this complexity probably isn’t worth consistency in default
> values across sources, if that is even achievable.
>
> In the sync we thought it was a good idea to send this out to the larger
> group to discuss. Please reply with comments!
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
Note that the design we make here will affect both data source developers
and end-users. It's better to provide reliable behaviors to end-users,
instead of asking them to read the spec of the data source and know which
value will be used for missing columns, when they write data.

If we do want to go with the "data source decides default value" approach,
we should create a new SQL syntax for ADD COLUMN, as its behavior is
different from the SQL standard ADD COLUMN command.

On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer 
wrote:

> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 is
> for a unified way of dealing with missing columns? I feel like that
> probably should be left up to the underlying datasource implementation. For
> example if you have missing columns with a database the Datasource can
> choose a value based on the Database's metadata if such a thing exists, I
> don't think Spark should really have a this level of detail but I've also
> missed out on all of these meetings (sorry it's family dinner time :) ) so
> I may be missing something.
>
> So my tldr is, Let a datasource report whether or not missing columns are
> OK and let the Datasource deal with the missing data based on it's
> underlying storage.
>
> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan  wrote:
>
>> I agree that we should not rewrite existing parquet files when a new
>> column is added, but we should also try out best to make the behavior same
>> as RDBMS/SQL standard.
>>
>> 1. it should be the user who decides the default value of a column, by
>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>> 2. When adding a new column, the default value should be effective for
>> all the existing data, and newly written data.
>> 3. When altering an existing column and change the default value, it
>> should be effective for newly written data only.
>>
>> A possible implementation:
>> 1. a columnn has 2 default values: the initial one and the latest one.
>> 2. when adding a column with a default value, set both the initial one
>> and the latest one to this value. But do not update existing data.
>> 3. when reading data, fill the missing column with the initial default
>> value
>> 4. when writing data, fill the missing column with the latest default
>> value
>> 5. when altering a column to change its default value, only update the
>> latest default value.
>>
>> This works because:
>> 1. new files will be written with the latest default value, nothing we
>> need to worry about at read time.
>> 2. old files will be read with the initial default value, which returns
>> expected result.
>>
>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> This thread is a follow-up to a discussion that we started in the DSv2
>>> community sync last week.
>>>
>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>> integrate supports schema evolution. Specifically, adding a new optional
>>> column so that rows without that column get a default value (null for
>>> Iceberg). The current validation rule for an append in DSv2 fails a write
>>> if it is missing a column, so adding a column to an existing table will
>>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>>> schema evolution shouldn't break existing jobs that produce valid data.
>>>
>>> To fix this problem, I suggested option 1: adding a way for Spark to
>>> check whether to fail when an optional column is missing. Other
>>> contributors in the sync thought that Spark should go with option 2:
>>> Spark’s schema should have defaults and Spark should handle filling in
>>> defaults the same way across all sources, like other databases.
>>>
>>> I think we agree that option 2 would be ideal. The problem is that it is
>>> very hard to implement.
>>>
>>> A source might manage data stored in millions of immutable Parquet
>>> files, so adding a default value isn’t possible. Spark would need to fill
>>> in defaults for files written before the column was added at read time (it
>>> could fill in defaults in new files at write time). Filling in defaults at
>>> read time would require Spark to fill in defaults for only some of the
>>> files in a scan, so Spark would need different handling for each task
>>> depending on the schema of that task. Tasks would also be required to
>>> produce a consistent schema, so a file without the new column couldn’t be
>>> combined into 

Re: Noisy spark-website notifications

2018-12-19 Thread Wenchen Fan
+1, at least it should only send one email when a PR is merged.

On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Can we somehow disable these new email alerts coming through for the Spark
> website repo?
>
> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>
>> ueshin commented on a change in pull request #163: Announce the schedule
>> of 2019 Spark+AI summit at SF
>> URL:
>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>
>>
>>
>>  ##
>>  File path: site/sitemap.xml
>>  ##
>>  @@ -139,657 +139,661 @@
>>  
>>  
>>  
>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>> +  
>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>> 
>>
>>  Review comment:
>>Still remaining `localhost:4000` in this file.
>>
>> 
>> This is an automated message from the Apache Git Service.
>> To respond to the message, please log on GitHub and use the
>> URL above to go to the specific comment.
>>
>> For queries about this service, please contact Infrastructure at:
>> us...@infra.apache.org
>>
>>
>> With regards,
>> Apache Git Services
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
So you agree with my proposal that we should follow RDBMS/SQL standard
regarding the behavior?

> pass the default through to the underlying data source

This is one way to implement the behavior.

On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue  wrote:

> I don't think we have to change the syntax. Isn't the right thing (for
> option 1) to pass the default through to the underlying data source?
> Sources that don't support defaults would throw an exception.
>
> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan  wrote:
>
>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN
>> column_name datatype [DEFAULT value];
>>
>> If the DEFAULT statement is not specified, then the default value is
>> null. If we are going to change the behavior and say the default value is
>> decided by the underlying data source, we should use a new SQL syntax(I
>> don't have a proposal in mind), instead of reusing the existing syntax, to
>> be SQL compatible.
>>
>> Personally I don't like re-invent wheels. It's better to just implement
>> the SQL standard ADD COLUMN command, which means the default value is
>> decided by the end-users.
>>
>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue  wrote:
>>
>>> Wenchen, can you give more detail about the different ADD COLUMN syntax?
>>> That sounds confusing to end users to me.
>>>
>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan  wrote:
>>>
>>>> Note that the design we make here will affect both data source
>>>> developers and end-users. It's better to provide reliable behaviors to
>>>> end-users, instead of asking them to read the spec of the data source and
>>>> know which value will be used for missing columns, when they write data.
>>>>
>>>> If we do want to go with the "data source decides default value"
>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
>>>> is different from the SQL standard ADD COLUMN command.
>>>>
>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want
>>>>> 2 is for a unified way of dealing with missing columns? I feel like that
>>>>> probably should be left up to the underlying datasource implementation. 
>>>>> For
>>>>> example if you have missing columns with a database the Datasource can
>>>>> choose a value based on the Database's metadata if such a thing exists, I
>>>>> don't think Spark should really have a this level of detail but I've also
>>>>> missed out on all of these meetings (sorry it's family dinner time :) ) so
>>>>> I may be missing something.
>>>>>
>>>>> So my tldr is, Let a datasource report whether or not missing columns
>>>>> are OK and let the Datasource deal with the missing data based on it's
>>>>> underlying storage.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>> I agree that we should not rewrite existing parquet files when a new
>>>>>> column is added, but we should also try out best to make the behavior 
>>>>>> same
>>>>>> as RDBMS/SQL standard.
>>>>>>
>>>>>> 1. it should be the user who decides the default value of a column,
>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>>>> 2. When adding a new column, the default value should be effective
>>>>>> for all the existing data, and newly written data.
>>>>>> 3. When altering an existing column and change the default value, it
>>>>>> should be effective for newly written data only.
>>>>>>
>>>>>> A possible implementation:
>>>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>>>> 2. when adding a column with a default value, set both the initial
>>>>>> one and the latest one to this value. But do not update existing data.
>>>>>> 3. when reading data, fill the missing column with the initial
>>>>>> default value
>>>>>> 4. when writing data, fill the missing column with the latest default
>>>>>> value
>>>>>> 5. when altering a column to change its default value, only update
>>>>>> the l

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.

On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans  wrote:

> This thread is to discuss adding in support for data frame processing
> using an in-memory columnar format compatible with Apache Arrow.  My main
> goal in this is to lay the groundwork so we can add in support for GPU
> accelerated processing of data frames, but this feature has a number of
> other benefits.  Spark currently supports Apache Arrow formatted data as an
> option to exchange data with python for pandas UDF processing. There has
> also been discussion around extending this to allow for exchanging data
> with other tools like pytorch, tensorflow, xgboost,... If Spark supports
> processing on Arrow compatible data it could eliminate the
> serialization/deserialization overhead when going between these systems.
> It also would allow for doing optimizations on a CPU with SIMD instructions
> similar to what Hive currently supports. Accelerated processing using a GPU
> is something that we will start a separate discussion thread on, but I
> wanted to set the context a bit.
>
> Jason Lowe, Tom Graves, and I created a prototype over the past few months
> to try and understand how to make this work.  What we are proposing is
> based off of lessons learned when building this prototype, but we really
> wanted to get feedback early on from the community. We will file a SPIP
> once we can get agreement that this is a good direction to go in.
>
> The current support for columnar processing lets a Parquet or Orc file
> format return a ColumnarBatch inside an RDD[InternalRow] using Scala’s type
> erasure. The code generation is aware that the RDD actually holds
> ColumnarBatchs and generates code to loop through the data in each batch as
> InternalRows.
>
> Instead, we propose a new set of APIs to work on an
> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we
> propose adding in a Rule similar to how WholeStageCodeGen currently works.
> Each part of the physical SparkPlan would expose columnar support through a
> combination of traits and method calls. The rule would then decide when
> columnar processing would start and when it would end. Switching between
> columnar and row based processing is not free, so the rule would make a
> decision based off of an estimate of the cost to do the transformation and
> the estimated speedup in processing time.
>
> This should allow us to disable columnar support by simply disabling the
> rule that modifies the physical SparkPlan.  It should be minimal risk to
> the existing row-based code path, as that code should not be touched, and
> in many cases could be reused to implement the columnar version.  This also
> allows for small easily manageable patches. No huge patches that no one
> wants to review.
>
> As far as the memory layout is concerned OnHeapColumnVector and
> OffHeapColumnVector are already really close to being Apache Arrow
> compatible so shifting them over would be a relatively simple change.
> Alternatively we could add in a new implementation that is Arrow compatible
> if there are reasons to keep the old ones.
>
> Again this is just to get the discussion started, any feedback is welcome,
> and we will file a SPIP on it once we feel like the major changes we are
> proposing are acceptable.
>
> Thanks,
>
> Bobby Evans
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread Wenchen Fan
+1, all the known blockers are resolved. Thanks for driving this!

On Wed, Mar 27, 2019 at 11:31 AM DB Tsai  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.1.
>
> The vote is open until March 30 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.1-rc9 (commit
> 58301018003931454e93d8a309c7149cf84c279e):
> https://github.com/apache/spark/tree/v2.4.1-rc9
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1319/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc9-docs/
>
> The list of bug fixes going into 2.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.1?
> ===
>
> The current list of open tickets targeted at 2.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Cross Join

2019-03-22 Thread Wenchen Fan
Spark 2.0 is EOL. Can you try 2.3 or 2.4?

On Thu, Mar 21, 2019 at 10:23 AM asma zgolli  wrote:

> Hello ,
>
> I need to cross my data and i'm executing a cross join on two dataframes .
>
> C = A.crossJoin(B)
> A has 50 records
> B has 5 records
>
> the result im getting with spark 2.0 is a dataframe C having 50 records.
>
> only the first row from B was added to C.
>
> Is that a bug in Spark?
>
> Asma ZGOLLI
>
> PhD student in data engineering - computer science
>
>


Re: DataSourceV2 exceptions

2019-04-08 Thread Wenchen Fan
Like `RDD.map`, you can throw whatever exceptions and they will be
propagated to the driver side and fail the Spark job.

On Mon, Apr 8, 2019 at 3:10 PM Andrew Melo  wrote:

> Hello,
>
> I'm developing a (java) DataSourceV2 to read a columnar fileformat
> popular in a number of physical sciences (https://root.cern.ch/). (I
> also understand that the API isn't fixed and subject to change).
>
> My question is -- what is the expected way to transmit exceptions from
> the DataSource up to Spark? The DSV2 interface (unless I'm misreading
> it) doesn't specify any caught exceptions that can be thrown in the
> DS, so should I instead catch/rethrow any exceptions as uncaught
> exceptions? If so, is there a recommended hierarchy to throw from?
>
> thanks!
> Andrew
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
+1

On Tue, Feb 19, 2019 at 10:50 AM Ryan Blue 
wrote:

> Hi everyone,
>
> It looks like there is consensus on the proposal, so I'd like to start a
> vote thread on the SPIP for identifiers in multi-catalog Spark.
>
> The doc is available here:
> https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing
>
> Please vote in the next 3 days.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks!
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
I think this is the right direction to go. Shall we move forward with a
vote and detailed designs?

On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue  wrote:

> Hi everyone,
>
> This is a follow-up to the "Identifiers with multi-catalog support"
> discussion thread. I've taken the proposal I posted to that thread and
> written it up as an official SPIP for how to identify tables and other
> catalog objects when working with multiple catalogs.
>
> The doc is available here:
> https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing
>
> Everyone should be able to comment on the doc or discuss on this thread.
> I'd like to call a vote for this SPIP sometime in the next week, unless
> there is still significant ongoing discussion. From the feedback in the
> DSv2 sync and on the previous thread, I think it should go quickly.
>
> Thanks for taking a look at the proposal,
>
> rb
>
>
> --
> Ryan Blue
>


Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Wenchen Fan
+1, thanks for making it clear that this SPIP focuses on
high-level direction!

On Sat, Mar 2, 2019 at 9:35 AM Reynold Xin  wrote:

> Thanks Ryan. +1.
>
>
>
>
> On Fri, Mar 01, 2019 at 5:33 PM, Ryan Blue  wrote:
>
>> Actually, I went ahead and removed the confusing section. There is no
>> public API in the doc now, so that it is clear that it isn't a relevant
>> part of this vote.
>>
>> On Fri, Mar 1, 2019 at 4:58 PM Ryan Blue  wrote:
>>
>> I moved the public API to the "Implementation Sketch" section. That API
>> is not an important part of this, as that section notes.
>>
>> I completely agree that SPIPs should be high-level and that the
>> specifics, like method names, are not hard requirements. The proposal was
>> more of a sketch, but I was asked by Xiao in the DSv2 sync to make sure the
>> list of methods was complete. I think as long as we have agreement that the
>> intent is not to make the exact names binding, we should be okay.
>>
>> I can remove the user-facing API sketch, but I'd prefer to leave it in
>> the sketch section so we have it documented somewhere.
>>
>> On Fri, Mar 1, 2019 at 4:51 PM Reynold Xin  wrote:
>>
>> Ryan - can you take the public user facing API part out of that SPIP?
>>
>> In general it'd be better to have the SPIPs be higher level, and put the
>> detailed APIs in a separate doc. Alternatively, put them in the SPIP but
>> explicitly vote on the high level stuff and not the detailed APIs.
>>
>> I don't want to get to a situation in which two months later the
>> identical APIs were committed with the justification that they were voted
>> on a while ago. In this case, it's even more serious because while I think
>> we all have consensus on the higher level internal API, not much discussion
>> has happened with the user-facing API and we should just leave that out
>> explicitly.
>>
>>
>>
>>
>>
>>
>> On Fri, Mar 01, 2019 at 1:00 PM, Anthony Young-Garner <
>> anthony.young-gar...@cloudera.com.invalid> wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge  wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> *From: *Jamison Bennett 
>> *Date: *Thursday, February 28, 2019 at 8:28 AM
>> *To: *Ryan Blue , Spark Dev List > >
>> *Subject: *Re: [VOTE] SPIP: Spark API for Table Metadata
>>
>>
>>
>> +1 (non-binding)
>>
>>
>> *Jamison Bennett*
>>
>> Cloudera Software Engineer
>>
>> jamison.benn...@cloudera.com
>>
>> 515 Congress Ave, Suite 1212   |   Austin, TX   |   78701
>>
>>
>>
>>
>>
>> On Thu, Feb 28, 2019 at 10:20 AM Ryan Blue 
>> wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>> On Wed, Feb 27, 2019, 6:28 PM Ryan Blue 
>> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> In the last DSv2 sync, the consensus was that the table metadata SPIP was
>> ready to bring up for a vote. Now that the multi-catalog identifier SPIP
>> vote has passed, I'd like to start one for the table metadata API,
>> TableCatalog.
>>
>>
>>
>> The proposal is for adding a TableCatalog interface that will be used by
>> v2 plans. That interface has methods to load, create, drop, alter, refresh,
>> rename, and check existence for tables. It also specifies the set of
>> metadata used to configure tables: schema, partitioning, and key-value
>> properties. For more information, please read the SPIP proposal doc
>> [docs.google.com]
>> 
>> .
>>
>>
>>
>> Please vote in the next 3 days.
>>
>>
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>>
>>
>>
>> Thanks!
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>> --
>> John Zhuge
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Wenchen Fan
+1

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li  wrote:

> +1
>
> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves 
> wrote:
>
>> +1 for the SPIP.
>>
>> Tom
>>
>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <
>> jiangxb1...@gmail.com> wrote:
>>
>>
>> Hi all,
>>
>> I want to call for a vote of SPARK-24615
>> . It improves Spark
>> by making it aware of GPUs exposed by cluster managers, and hence Spark can
>> match GPU resources with user task requests properly. The proposal
>> 
>>  and production doc
>> 
>>  was
>> made available on dev@ to collect input. Your can also find a design
>> sketch at SPARK-27005 
>> .
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thank you!
>>
>> Xingbo
>>
>


Re: Moving forward with the timestamp proposal

2019-02-20 Thread Wenchen Fan
I think this is the right direction to go, but I'm wondering how can Spark
support these new types if the underlying data sources(like parquet files)
do not support them yet.

I took a quick look at the new doc for file formats, but not sure what's
the proposal. Are we going to implement these new types in Parquet/Orc
first? Or are we going to use low-level physical types directly and add
Spark-specific metadata to Parquet/Orc files?

On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi 
wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
>  with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>- The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>semantics.
>- The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>semantics.
>- The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Wenchen Fan
I'm good with the list from Ryan, thanks!

On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue  wrote:

> I think that's a good plan. Let's get the functionality done, but mark it
> experimental pending a new row API.
>
> So is there agreement on this set of work, then?
>
> On Tue, Feb 26, 2019 at 6:30 PM Matei Zaharia 
> wrote:
>
>> To add to this, we can add a stable interface anytime if the original one
>> was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of
>> APIs that were experimental in 2.0 and then got stabilized in later 2.x
>> releases for example.
>>
>> Matei
>>
>> > On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
>> >
>> > We will have to fix that before we declare dev2 is stable, because
>> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
>> >
>> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
>> > Will that then require an API break down the line? Do we save that for
>> Spark 4?
>> >
>> >
>> >
>> >
>> > -Matt Cheah?
>> >
>> >
>> >
>> > From: Ryan Blue 
>> > Reply-To: "rb...@netflix.com" 
>> > Date: Tuesday, February 26, 2019 at 4:53 PM
>> > To: Matt Cheah 
>> > Cc: Sean Owen , Wenchen Fan ,
>> Xiao Li , Matei Zaharia ,
>> Spark Dev List 
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > That's a good question.
>> >
>> >
>> >
>> > While I'd love to have a solution for that, I don't think it is a good
>> idea to delay DSv2 until we have one. That is going to require a lot of
>> internal changes and I don't see how we could make the release date if we
>> are including an InternalRow replacement.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
>> >
>> > Reynold made a note earlier about a proper Row API that isn’t
>> InternalRow – is that still on the table?
>> >
>> >
>> >
>> > -Matt Cheah
>> >
>> >
>> >
>> > From: Ryan Blue 
>> > Reply-To: "rb...@netflix.com" 
>> > Date: Tuesday, February 26, 2019 at 4:40 PM
>> > To: Matt Cheah 
>> > Cc: Sean Owen , Wenchen Fan ,
>> Xiao Li , Matei Zaharia ,
>> Spark Dev List 
>> > Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
>> >
>> >
>> >
>> > Thanks for bumping this, Matt. I think we can have the discussion here
>> to clarify exactly what we’re committing to and then have a vote thread
>> once we’re agreed.
>> > Getting back to the DSv2 discussion, I think we have a good handle on
>> what would be added:
>> > · Plugin system for catalogs
>> >
>> > · TableCatalog interface (I’ll start a vote thread for this
>> SPIP shortly)
>> >
>> > · TableCatalog implementation backed by SessionCatalog that can
>> load v2 tables
>> >
>> > · Resolution rule to load v2 tables using the new catalog
>> >
>> > · CTAS logical and physical plan nodes
>> >
>> > · Conversions from SQL parsed logical plans to v2 logical plans
>> >
>> > Initially, this will always use the v2 catalog backed by SessionCatalog
>> to avoid dependence on the multi-catalog work. All of those are already
>> implemented and working, so I think it is reasonable that we can get them
>> in.
>> > Then we can consider a few stretch goals:
>> > · Get in as much DDL as we can. I think create and drop table
>> should be easy.
>> >
>> > · Multi-catalog identifier parsing and multi-catalog support
>> >
>> > If we get those last two in, it would be great. We can make the call
>> closer to release time. Does anyone want to change this set of work?
>> >
>> >
>> > On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>> >
>> > What would then be the next steps we'd take to collectively decide on
>> plans and timelines moving forward? Might I suggest scheduling a conference
>> call with appropriate PMCs to put our ideas together? Maybe such a
>> discussion can take place at next week's meeting? Or do we need to have a
>> separate formalized voting thread which is guided by a PMC?
>> >
>> > My suggestion is to try to make concrete steps forward and to avoid
>> letting this slip through the cracks.
>> >
>> > I also thin

Re: [build system] jenkins wedged again, rebooting master node

2019-03-15 Thread Wenchen Fan
cool, thanks!

On Sat, Mar 16, 2019 at 1:08 AM shane knapp  wrote:

> well, that box rebooted in record time!  we're back up and building.
>
> and as always, i'll keep a close eye on things today...  jenkins usually
> works great, until it doesn't.  :\
>
> On Fri, Mar 15, 2019 at 9:52 AM shane knapp  wrote:
>
>> as some of you may have noticed, jenkins got itself in a bad state
>> multiple times over the past couple of weeks.  usually restarting the
>> service is sufficient, but it appears that i need to hit it w/the reboot
>> hammer.
>>
>> jenkins will be down for the next 20-30 minutes as the node reboots and
>> jenkins spins back up.  i'll reply here w/any updates.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Wenchen Fan
Which version of Parquet has this bug? Maybe we can downgrade it.

On Mon, Mar 11, 2019 at 10:34 AM Mark Hamstra 
wrote:

> It worked in 2.3. We broke it with 2.4.0 and were informed of that
> regression late in the 2.4.0 release process. Since we didn't fix it before
> the 2.4.0 release, it should have been noted as a known issue. To now claim
> that there is no regression from 2.4.0 is a circular argument denying the
> existence of a known regression from 2.3.
>
> On Sun, Mar 10, 2019 at 6:53 PM Sean Owen  wrote:
>
>> From https://issues.apache.org/jira/browse/SPARK-25588, I'm reading that:
>>
>> - this is a Parquet-Avro version conflict thing
>> - a downstream app wants different versions of Parquet and Avro than
>> Spark uses, which triggers it
>> - it doesn't work in 2.4.0
>>
>> It's not a regression from 2.4.0, which is the immediate question.
>> There isn't even a Parquet fix available.
>> But I'm not even seeing why this is excuse-making?
>>
>> On Sun, Mar 10, 2019 at 8:44 PM Mark Hamstra 
>> wrote:
>> >
>> > Now wait... we created a regression in 2.4.0. Arguably, we should have
>> blocked that release until we had a fix; but the issue came up late in the
>> release process and it looks to me like there wasn't an adequate fix
>> immediately available, so we did something bad and released 2.4.0 with a
>> known regression. Saying that there is now no regression from 2.4 is
>> tautological and no excuse for not taking in a fix -- and it looks like
>> that fix has been waiting for months.
>>
>


Re: spark sql occer error

2019-03-22 Thread Wenchen Fan
Did you include the whole error message?

On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote:

> Hi ,
> I ran a spark sql like this:
>
> *select imei,tag, product_id,*
> *   sum(case when succ1>=1 then 1 else 0 end) as succ,*
> *   sum(case when fail1>=1 and succ1=0 then 1 else 0 end) as fail,*
> *   count(*) as cnt*
> *from t_tbl*
> *where sum(case when succ1>=1 then 1 else 0 end)=0 and sum(case when
> fail1>=1 and succ1=0 then 1 else 0 end)>0*
> *group by **tag, product_id, app_version*
>
> It occur a problem below:
>
> * execute, tree:*
> *Exchange hashpartitioning(imei#0, tag#1, product_id#2, 100)*
> *+- *(1) HashAggregate(keys=[imei#0, tag#1, product_id#2],
> functions=[partial_sum(cast(CASE WHEN (succ1#3L >= 1) THEN 1 ELSE 0 END as
> bigint)), partial_sum(cast(CASE WHEN ((fail1#4L >= 1) && (succ1#3L = 0))
> THEN 1 ELSE 0 END as bigint)), partial_count(1)], output=[imei#0, tag#1,
> product_id#2, sum#49L, sum#50L, count#51L])*
> *   +- *(1) Filter ((sum(cast(CASE WHEN (succ1#3L >= 1) THEN 1 ELSE 0 END
> as bigint)) = 0) && (sum(cast(CASE WHEN ((fail1#4L >= 1) && (succ1#3L = 0))
> THEN 1 ELSE 0 END as bigint)) > 0))*
> *  +- *(1) FileScan json [imei#0,tag#1,product_id#2,succ1#3L,fail1#4L]
> Batched: false, Format: JSON, Location: InMemoryFileIndex[hdfs://xx],
> PartitionFilters: [], PushedFilters: [], ReadSchema:
> struct*
>
>
> Could anyone help me to solve this problem?
> my spark version is 2.3.1
> thank you.
>


Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan
Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ?

On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew 
wrote:

> Thanks a ton for the help!
>
>
>
> Is there a standardized way of converting the internal row to rows?
>
>
>
> I’ve tried this but im getting an exception
>
>
>
> *val *enconder = *RowEncoder*(df.schema)
> *val *rows = readFile(pFile).flatMap(_ *match *{
>   *case *r: InternalRow => *Seq*(r)
>   *case *b: ColumnarBatch => b.rowIterator().asScala
> })
>   .map(enconder.fromRow(_))
>   .toList
>
>
>
> java.lang.RuntimeException: Error while decoding:
> java.lang.UnsupportedOperationException: Cannot evaluate expression:
> getcolumnbyordinal(0, IntegerType)
>
> createexternalrow(getcolumnbyordinal(0, IntegerType),
> getcolumnbyordinal(1, IntegerType), getcolumnbyordinal(2,
> StringType).toString, StructField(pk,IntegerType,false),
> StructField(ordering,IntegerType,false), StructField(col_a,StringType,true))
>
>
>
> at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:305)
>
> at
> com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100)
>
> at
> com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100)
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Thursday, March 21, 2019 at 3:32 PM
> *To: *"Long, Andrew" 
> *Cc: *"dev@spark.apache.org" , "
> u...@spark.apache.org" , "horizon-...@amazon.com" <
> horizon-...@amazon.com>
> *Subject: *Re: Manually reading parquet files.
>
>
>
> You're getting InternalRow instances. They probably have the data you
> want, but the toString representation doesn't match the data for
> InternalRow.
>
>
>
> On Thu, Mar 21, 2019 at 3:28 PM Long, Andrew 
> wrote:
>
> Hello Friends,
>
>
>
> I’m working on a performance improvement that reads additional parquet
> files in the middle of a lambda and I’m running into some issues.  This is
> what id like todo
>
>
>
> ds.mapPartitions(x=>{
>   //read parquet file in and perform an operation with x
> })
>
>
>
>
>
> Here’s my current POC code but I’m getting nonsense back from the row
> reader.
>
>
>
> *import *com.amazon.horizon.azulene.util.SparkFileUtils._
>
> *spark*.*conf*.set("spark.sql.parquet.enableVectorizedReader","false")
>
> *val *data = *List*(
>   *TestRow*(1,1,"asdf"),
>   *TestRow*(2,1,"asdf"),
>   *TestRow*(3,1,"asdf"),
>   *TestRow*(4,1,"asdf")
> )
>
> *val *df = *spark*.createDataFrame(data)
>
> *val *folder = Files.*createTempDirectory*("azulene-test")
>
> *val *folderPath = folder.toAbsolutePath.toString + "/"
> df.write.mode("overwrite").parquet(folderPath)
>
> *val *files = *spark*.fs.listStatus(folder.toUri)
>
> *val *file = files(1)//skip _success file
>
> *val *partitionSchema = *StructType*(*Seq*.empty)
> *val *dataSchema = df.schema
> *val *fileFormat = *new *ParquetFileFormat()
>
> *val *path = file.getPath
>
> *val *status = *spark*.fs.getFileStatus(path)
>
> *val *pFile = *new *PartitionedFile(
>   partitionValues = InternalRow.*empty*,//This should be empty for non
> partitioned values
>   filePath = path.toString,
>   start = 0,
>   length = status.getLen
> )
>
> *val *readFile: (PartitionedFile) => Iterator[Any] =
> //Iterator[InternalRow]
>   fileFormat.buildReaderWithPartitionValues(
> sparkSession = *spark*,
> dataSchema = dataSchema,
> partitionSchema = partitionSchema,//this should be empty for non
> partitioned feilds
> requiredSchema = dataSchema,
> filters = *Seq*.empty,
> options = *Map*.*empty*,
> hadoopConf = *spark*.sparkContext.hadoopConfiguration
> //relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))
>   )
>
> *import *scala.collection.JavaConverters._
>
> *val *rows = readFile(pFile).flatMap(_ *match *{
>   *case *r: InternalRow => *Seq*(r)
>
>   // This doesn't work. vector mode is doing something screwy
>   *case *b: ColumnarBatch => b.rowIterator().asScala
> }).toList
>
> *println*(rows)
> //List([0,1,5b,24,66647361])
> //??this is wrong I think
>
>
>
> Has anyone attempted something similar?
>
>
>
> Cheers Andrew
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


Re: Compatibility on build-in DateTime functions with Hive/Presto

2019-02-17 Thread Wenchen Fan
in the Spark SQL example, `year("1912")` means, first cast "1912" to date
type, and then call the "year" function.

in the Postgres example, `date_part('year',TIMESTAMP '2017')` means, get a
timestamp literal, and call the "date_part" function.

Can you try date literal in Postgres?

On Mon, Feb 18, 2019 at 11:21 AM Xiao Li  wrote:

> It is hard to say this is a bug. In the existing Spark applications, the
> current behavior might be already considered as a feature instead of a bug.
> I am thinking if we should introduce a strict mode to throw an exception
> for these type casting, like what Postgres behaves.
>
> Darcy Shen  于2019年2月17日周日 下午6:22写道:
>
>> For PostgreSQL:
>>
>> postgres=# SELECT date_part('year',TIMESTAMP '2017-01-01');
>> date_part
>> ---
>>   2017
>> (1 row)
>>
>> postgres=# SELECT date_part('year',TIMESTAMP '2017');
>> ERROR:  invalid input syntax for type timestamp: "2017"
>> LINE 1: SELECT date_part('year',TIMESTAMP '2017');
>>   ^
>> postgres=# SELECT date_part('month',TIMESTAMP '2017-01-01');
>> date_part
>> ---
>>  1
>> (1 row)
>>
>> postgres=# SELECT date_part('year',TIMESTAMP '2017-1-1');
>> date_part
>> ---
>>   2017
>> (1 row)
>>
>>
>> We'd better follow the Hive semantics. And removing support for  and
>> -d[d] will simplify the routine.
>>
>> I'll create a Pull Request later.
>>
>>
>>  On Sat, 16 Feb 2019 00:51:43 +0800 *Xiao Li > >* wrote 
>>
>> We normally do not follow MySQL. Check the commercial database [like
>> Oracle]? or the open source PostgreSQL?
>>
>> Sean Owen  于2019年2月15日周五 上午5:34写道:
>>
>> year("1912") == 1912 makes sense; month("1912") == 1 is odd but not
>>> wrong. On the one hand, some answer might be better than none. But
>>> then, we are trying to match Hive semantics where the SQL standard is
>>> silent. Is this actually defined behavior in a SQL standard, or, what
>>> does MySQL do?
>>>
>>> On Fri, Feb 15, 2019 at 2:07 AM Darcy Shen 
>>> wrote:
>>> >
>>> > See https://issues.apache.org/jira/browse/SPARK-26885 and
>>> https://github.com/apache/spark/blob/71170e74df5c7ec657f61154212d1dc2ba7d0613/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
>>> >
>>> >
>>> >
>>> >
>>> > stringToTimestamp, stringToDate support , as a result:
>>> >
>>> > select year("1912") => 1912
>>> >
>>> > select month("1912") => 1
>>> >
>>> > select hour("1912") => 0
>>> >
>>> >
>>> >
>>> > In Presto or Hive,
>>> >
>>> > select year("1912") => null
>>> >
>>> > select month("1912") => null
>>> >
>>> > select hour("1912") => null
>>> >
>>> >
>>> >
>>> > It is not a good idea to support  for a Date/DateTime. As well as
>>> -[d]d.
>>> >
>>> >
>>> > What's your opinion?
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job!

On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon  wrote:

> Yay! Good job Takeshi!
>
> On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro 
>> We are happy to announce the availability of Spark 2.3.3!
>>
>> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3
>> maintenance branch of Spark. We strongly recommend all 2.3.x users to
>> upgrade to this stable release.
>>
>> To download Spark 2.3.3, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-3-3.html
>>
>> We would like to acknowledge all community members for contributing to
>> this release. This release would not have been possible without you.
>>
>> Best,
>> Takeshi
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Wenchen Fan
+1 for 2.4.1

On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon  wrote:

> +1 for 2.4.1
>
> 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성:
>
>> > SPARK-23539 is a non-trivial improvement, so probably would not be
>> back-ported to 2.4.x.
>>
>> Got it. It seems reasonable.
>>
>> Committers:
>>
>> Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this
>> feature.
>>
>> Thanks,
>> Dongjin
>>
>> On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro 
>> wrote:
>>
>>> +1, too.
>>> branch-2.4 accumulates too many commits..:
>>>
>>> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
>>>
>>> On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, DB.

 +1, Yes. It's time for preparing 2.4.1 release.

 Bests,
 Dongjoon.

 On 2019/02/12 03:16:05, Sean Owen  wrote:
 > I support a 2.4.1 release now, yes.
 >
 > SPARK-23539 is a non-trivial improvement, so probably would not be
 > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
 > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
 > it, but it could go in if otherwise ready.
 >
 >
 > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee 
 wrote:
 > >
 > > Hi DB,
 > >
 > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a
 little bit ago, but it has not included in 2.3.0 nor get enough review.
 > >
 > > Thanks,
 > > Dongjin
 > >
 > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
 > > [^2]: https://github.com/apache/spark/pull/22282
 > >
 > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim 
 wrote:
 > >>
 > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is
 submitted, I hope it can be reviewed and included within Spark 2.4.1 -
 otherwise it will be a long-live correctness issue.
 > >>
 > >> Thanks,
 > >> Jungtaek Lim (HeartSaVioR)
 > >>
 > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
 > >> 2. https://github.com/apache/spark/pull/23634
 > >>
 > >>
 > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
 > >>>
 > >>> Hello all,
 > >>>
 > >>> I am preparing to cut a new Apache 2.4.1 release as there are
 many bugs and correctness issues fixed in branch-2.4.
 > >>>
 > >>> The list of addressed issues are
 https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
 > >>>
 > >>> Let me know if you have any concern or any PR you would like to
 get in.
 > >>>
 > >>> Thanks!
 > >>>
 > >>>
 -
 > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 > >>>
 > >
 > >
 > > --
 > > Dongjin Lee
 > >
 > > A hitchhiker in the mathematical world.
 > >
 > > github: github.com/dongjinleekr
 > > linkedin: kr.linkedin.com/in/dongjinleekr
 > > speakerdeck: speakerdeck.com/dongjin
 >
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>> --
>> *Dongjin Lee*
>>
>> *A hitchhiker in the mathematical world.*
>> *github:  github.com/dongjinleekr
>> linkedin: kr.linkedin.com/in/dongjinleekr
>> speakerdeck: speakerdeck.com/dongjin
>> *
>>
>


Re: Time to cut an Apache 2.4.1 release?

2019-02-14 Thread Wenchen Fan
Do you know which bug ORC 1.5.2 introduced? Or is it because Hive uses a
legacy version of ORC which has a bug?

On Thu, Feb 14, 2019 at 2:35 PM Darcy Shen  wrote:

>
> We found that ORC table created by Spark 2.4 failed to be read by Hive
> 2.1.1.
>
>
> spark-sql -e 'CREATE TABLE tmp.orcTable2 USING orc AS SELECT * FROM
> tmp.orcTable1 limit 10;'
>
> hive -e 'select * from tmp.orcTable2'
>
> The ERROR messages by Hive:
>
> Failed with exception java.io.IOException:java.lang.RuntimeException: ORC
> split generation failed with exception:
> java.lang.ArrayIndexOutOfBoundsException: 6
>
> And Spark 2.3.2 (or below) works fine.
>
> I think we should git revert [SPARK-24576][BUILD] Upgrade Apache ORC to
> 1.5.2 by Dongjoon Hyun
>
>
>  On Tue, 12 Feb 2019 16:56:09 +0800 *Dongjin Lee  >* wrote 
>
> > SPARK-23539 is a non-trivial improvement, so probably would not be
> back-ported to 2.4.x.
>
> Got it. It seems reasonable.
>
> Committers:
>
> Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this
> feature.
>
> Thanks,
> Dongjin
>
> On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro 
> wrote:
>
>
>
> --
>
> *Dongjin Lee*
>
>
> *A hitchhiker in the mathematical world.*
>
>
>
>
> *github:  github.com/dongjinleekr
> linkedin: kr.linkedin.com/in/dongjinleekr
> speakerdeck: speakerdeck.com/dongjin
> *
>
>> +1, too.
>> branch-2.4 accumulates too many commits..:
>>
>> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
>>
>> On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, DB.
>>>
>>> +1, Yes. It's time for preparing 2.4.1 release.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On 2019/02/12 03:16:05, Sean Owen  wrote:
>>> > I support a 2.4.1 release now, yes.
>>> >
>>> > SPARK-23539 is a non-trivial improvement, so probably would not be
>>> > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
>>> > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
>>> > it, but it could go in if otherwise ready.
>>> >
>>> >
>>> > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee 
>>> wrote:
>>> > >
>>> > > Hi DB,
>>> > >
>>> > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a
>>> little bit ago, but it has not included in 2.3.0 nor get enough review.
>>> > >
>>> > > Thanks,
>>> > > Dongjin
>>> > >
>>> > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
>>> > > [^2]: https://github.com/apache/spark/pull/22282
>>> > >
>>> > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim 
>>> wrote:
>>> > >>
>>> > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is
>>> submitted, I hope it can be reviewed and included within Spark 2.4.1 -
>>> otherwise it will be a long-live correctness issue.
>>> > >>
>>> > >> Thanks,
>>> > >> Jungtaek Lim (HeartSaVioR)
>>> > >>
>>> > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
>>> > >> 2. https://github.com/apache/spark/pull/23634
>>> > >>
>>> > >>
>>> > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
>>> > >>>
>>> > >>> Hello all,
>>> > >>>
>>> > >>> I am preparing to cut a new Apache 2.4.1 release as there are many
>>> bugs and correctness issues fixed in branch-2.4.
>>> > >>>
>>> > >>> The list of addressed issues are
>>> https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
>>> > >>>
>>> > >>> Let me know if you have any concern or any PR you would like to
>>> get in.
>>> > >>>
>>> > >>> Thanks!
>>> > >>>
>>> > >>>
>>> -
>>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>>
>>> > >
>>> > >
>>> > > --
>>> > > Dongjin Lee
>>> > >
>>> > > A hitchhiker in the mathematical world.
>>> > >
>>> > > github: github.com/dongjinleekr
>>> > > linkedin: kr.linkedin.com/in/dongjinleekr
>>> > > speakerdeck: speakerdeck.com/dongjin
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>


Re: Tungsten Memory Consumer

2019-02-11 Thread Wenchen Fan
what do you mean by ''Tungsten Consumer"?

On Fri, Feb 8, 2019 at 6:11 PM Jack Kolokasis 
wrote:

> Hello all,
>  I am studying about Tungsten Project and I am wondering when Spark
> creates a Tungsten consumer. While I am running some applications, I see
> that Spark creates Tungsten Consumer while in other applications not
> (using the same configuration). When does this happens ?
>
> I am looking forward for your reply.
>
> --Jack Kolokasis
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [build system] Jenkins stopped working

2019-02-19 Thread Wenchen Fan
Thanks Shane!

On Wed, Feb 20, 2019 at 6:48 AM shane knapp  wrote:

> alright, i increased the httpd and proxy timeouts and kicked apache.  i'll
> keep an eye on things, but as of right now we're happily building.
>
> On Tue, Feb 19, 2019 at 2:25 PM shane knapp  wrote:
>
>> aand i had to issue another restart.  it's the ever annoying, and
>> never quite clear as to why it's happening proxy/502 error.
>>
>> currently investigating.
>>
>> On Tue, Feb 19, 2019 at 9:21 AM shane knapp  wrote:
>>
>>> forgot to hit send before i went in to the office:  we're back up and
>>> building!
>>>
>>> On Tue, Feb 19, 2019 at 8:06 AM shane knapp  wrote:
>>>
 yep, it got wedged.  issued a restart and it should be back up in a few
 minutes.

 On Tue, Feb 19, 2019 at 7:32 AM Parth Gandhi 
 wrote:

> Yes, it seems to be down. The unit tests are not getting kicked off.
>
> Regards,
> Parth Kamlesh Gandhi
>
>
> On Tue, Feb 19, 2019 at 8:29 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> Looks Jenkins stopped working. Did I maybe miss a thread, or anybody
>> didn't report this yet?
>>
>> Thanks!
>>
>>
>>

 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Spark 2.4.2

2019-04-17 Thread Wenchen Fan
I volunteer to be the release manager for 2.4.2, as I was also going to
propose 2.4.2 because of the reverting of SPARK-25250. Is there any other
ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the
release process today (CST).

Thanks,
Wenchen

On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:

> I think the 'only backport bug fixes to branches' principle remains sound.
> But what's a bug fix? Something that changes behavior to match what is
> explicitly supposed to happen, or implicitly supposed to happen -- implied
> by what other similar things do, by reasonable user expectations, or simply
> how it worked previously.
>
> Is this a bug fix? I guess the criteria that matches is that behavior
> doesn't match reasonable user expectations? I don't know enough to have a
> strong opinion. I also don't think there is currently an objection to
> backporting it, whatever it's called.
>
>
> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
>
>
> The meta question remains: is a 'bug fix' definition even agreed, and
> being consistently applied? There aren't correct answers, only best guesses
> from each person's own experience, judgment and priorities. These can
> differ even when applied in good faith.
>
> Sometimes the variance of opinion comes because people have different info
> that needs to be surfaced. Here, maybe it's best to share what about that
> offline conversation was convincing, for example.
>
> I'd say it's also important to separate what one would prefer from what
> one can't live with(out). Assuming one trusts the intent and experience of
> the handful of others with an opinion, I'd defer to someone who wants X and
> will own it, even if I'm moderately against it. Otherwise we'd get little
> done.
>
> In that light, it seems like both of the PRs at issue here are not _wrong_
> to backport. This is a good pair that highlights why, when there isn't a
> clear reason to do / not do something (e.g. obvious errors, breaking public
> APIs) we give benefit-of-the-doubt in order to get it later.
>
>
> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
>
>> Sorry, I should be more clear about what I'm trying to say here.
>>
>> In the past, Xiao has taken the opposite stance. A good example is PR
>> #21060
>>  that
>> was a very similar situation: behavior didn't match what was expected and
>> there was low risk. There was a long argument and the patch didn't make it
>> into 2.3 (to my knowledge).
>>
>> What we call these low-risk behavior fixes doesn't matter. I called it a
>> bug on #21060 but I'm applying Xiao's previous definition here to make a
>> point. Whatever term we use, we clearly have times when we want to allow a
>> patch because it is low risk and helps someone. Let's just be clear that
>> that's perfectly fine.
>>
>> On Wed, Apr 17, 2019 at 9:34 AM Ryan Blue  wrote:
>>
>>> How is this a bug fix?
>>>
>>> On Wed, Apr 17, 2019 at 9:30 AM Xiao Li  wrote:
>>>
 Michael and I had an offline discussion about this PR
 https://github.com/apache/spark/pull/24365. He convinced me that this
 is a bug fix. The code changes of this bug fix are very tiny and the risk
 is very low. To avoid any behavior change in the patch releases, this PR
 also added a legacy flag whose default value is off.




Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why
re-creating the dataframe doesn't work: Spark finds out the logical plan is
cached and picks the cached data.

You need to uncache the dataframe, or go back to the SQL way:
df.createTempView("abc")
spark.table("abc").cache()
df.show // returns latest data.
spark.table("abc").show // returns cached data.


On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos 
wrote:

> I'm trying to re-read however I'm getting cached data (which is a bit
> confusing). For re-read I'm issuing:
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count
>
> The cache seems to be global influencing also new dataframes.
>
> So the question is how should I re-read without loosing the cached data
> (without using unpersist) ?
>
> As I mentioned with sql its possible - I can create a cached view, so wen
> I access the original table I get live data, when I access the view I get
> cached data.
>
> BR,
> Tomas
>
> On Fri, 17 May 2019, 8:57 pm Sean Owen,  wrote:
>
>> A cached DataFrame isn't supposed to change, by definition.
>> You can re-read each time or consider setting up a streaming source on
>> the table which provides a result that updates as new data comes in.
>>
>> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos 
>> wrote:
>> >
>> > Hello,
>> >
>> > I have a cached dataframe:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>> >
>> > I would like to access the "live" data for this data frame without
>> deleting the cache (using unpersist()). Whatever I do I always get the
>> cached data on subsequent queries. Even adding new column to the query
>> doesn't help:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>> lit("dummy"))
>> >
>> >
>> > I'm able to workaround this using cached sql view, but I couldn't find
>> a pure dataFrame solution.
>> >
>> > Thank you,
>> > Tomas
>>
>


Re: RDD object Out of scope.

2019-05-21 Thread Wenchen Fan
RDD is kind of a pointer to the actual data. Unless it's cached, we don't
need to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris
 wrote:

> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I
> found the contextcleaner
> 
> code in which only persisted RDDs are cleaned up in regular intervals if
> the RDD is registered to cleanup. I have not found where the destructor for
> RDD object is invoked. I am trying to understand when RDD cleanup happens
> when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>
>


Re: [VOTE] Release Apache Spark 2.4.2

2019-04-29 Thread Wenchen Fan
>  it could just be fixed in master rather than back-port and re-roll the RC

I don't think the release script is part of the released product. That
said, we can just fix the release script in branch 2.4 without creating a
new RC. We can even create a new repo for the release script, like
spark-website, to make it clearer.

On Tue, Apr 30, 2019 at 7:22 AM Sean Owen  wrote:

> I think this is a reasonable idea; I know @vanzin had suggested it was
> simpler to use the latest in case a bug was found in the release script and
> then it could just be fixed in master rather than back-port and re-roll the
> RC. That said I think we did / had to already drop the ability to build <=
> 2.3 from the master release script already.
>
> On Sun, Apr 28, 2019 at 9:25 PM Wenchen Fan  wrote:
>
>> >  ... by using the release script of Spark 2.4 branch
>>
>> Shall we keep it as a policy? Previously we used the release script from
>> the master branch to do the release work for all Spark versions, now I feel
>> it's simpler and less error-prone to let the release script only handle one
>> branch. We don't keep many branches as active at the same time, so the
>> maintenance overhead for the release script should be OK.
>>
>>>
>>>


[VOTE] Release Apache Spark 2.4.2

2019-04-18 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version
2.4.2.

The vote is open until April 23 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.4.2-rc1 (commit
a44880ba74caab7a987128cb09c4bee41617770a):
https://github.com/apache/spark/tree/v2.4.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1322/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/

The list of bug fixes going into 2.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12344996

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.2?
===

The current list of open tickets targeted at 2.4.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: Spark 2.4.2

2019-04-18 Thread Wenchen Fan
I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut RC2
shortly.

Thanks,
Wenchen

On Fri, Apr 19, 2019 at 3:32 AM Felix Cheung 
wrote:

> Re shading - same argument I’ve made earlier today in a PR...
>
> (Context- in many cases Spark has light or indirect dependencies but
> bringing them into the process breaks users code easily)
>
>
> --
> *From:* Michael Heuer 
> *Sent:* Thursday, April 18, 2019 6:41 AM
> *To:* Reynold Xin
> *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen
> Fan; Xiao Li
> *Subject:* Re: Spark 2.4.2
>
> +100
>
>
> On Apr 18, 2019, at 1:48 AM, Reynold Xin  wrote:
>
> We should have shaded all Spark’s dependencies :(
>
> On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:
>
>> For users that would inherit Jackson and use it directly, or whose
>> dependencies do. Spark itself (with modifications) should be OK with
>> the change.
>> It's risky and normally wouldn't backport, except that I've heard a
>> few times about concerns about CVEs affecting Databind, so wondering
>> who else out there might have an opinion. I'm not pushing for it
>> necessarily.
>>
>> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
>> >
>> > For Jackson - are you worrying about JSON parsing for users or internal
>> Spark functionality breaking?
>> >
>> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
>> >>
>> >> There's only one other item on my radar, which is considering updating
>> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
>> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
>> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
>> >> behavior non-trivially. That said back-porting the update PR to 2.4
>> >> worked out OK locally. Any strong opinions on this one?
>> >>
>> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > I volunteer to be the release manager for 2.4.2, as I was also going
>> to propose 2.4.2 because of the reverting of SPARK-25250. Is there any
>> other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
>> start the release process today (CST).
>> >> >
>> >> > Thanks,
>> >> > Wenchen
>> >> >
>> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
>> >> >>
>> >> >> I think the 'only backport bug fixes to branches' principle remains
>> sound. But what's a bug fix? Something that changes behavior to match what
>> is explicitly supposed to happen, or implicitly supposed to happen --
>> implied by what other similar things do, by reasonable user expectations,
>> or simply how it worked previously.
>> >> >>
>> >> >> Is this a bug fix? I guess the criteria that matches is that
>> behavior doesn't match reasonable user expectations? I don't know enough to
>> have a strong opinion. I also don't think there is currently an objection
>> to backporting it, whatever it's called.
>> >> >>
>> >> >>
>> >> >> Is the question whether this needs a new release? There's no harm
>> in another point release, other than needing a volunteer release manager.
>> One could say, wait a bit longer to see what more info comes in about
>> 2.4.1. But given that 2.4.1 took like 2 months, it's reasonable to move
>> towards a release cycle again. I don't see objection to that either (?)
>> >> >>
>> >> >>
>> >> >> The meta question remains: is a 'bug fix' definition even agreed,
>> and being consistently applied? There aren't correct answers, only best
>> guesses from each person's own experience, judgment and priorities. These
>> can differ even when applied in good faith.
>> >> >>
>> >> >> Sometimes the variance of opinion comes because people have
>> different info that needs to be surfaced. Here, maybe it's best to share
>> what about that offline conversation was convincing, for example.
>> >> >>
>> >> >> I'd say it's also important to separate what one would prefer from
>> what one can't live with(out). Assuming one trusts the intent and
>> experience of the handful of others with an opinion, I'd defer to someone
>> who wants X and will own it, even if I'm moderately against it. Otherwise
>> we'd get little done.
>> >> >>
>> >> >> In that li

Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread Wenchen Fan
+1.

The Scala version problem has been resolved, which is the main motivation
of 2.4.3.

On Mon, May 6, 2019 at 12:38 AM Felix Cheung 
wrote:

> I ran basic tests on R, r-hub etc. LGTM.
>
> +1 (limited - I didn’t get to run other usual tests)
>
> --
> *From:* Sean Owen 
> *Sent:* Wednesday, May 1, 2019 2:21 PM
> *To:* Xiao Li
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [VOTE] Release Apache Spark 2.4.3
>
> +1 from me. There is little change from 2.4.2 anyway, except for the
> important change to the build script that should build pyspark with
> Scala 2.11 jars. I verified that the package contains the _2.11 Spark
> jars, but have a look!
>
> I'm still getting this weird error from the Kafka module when testing,
> but it's a long-standing weird known issue:
>
> [error]
> /home/ubuntu/spark-2.4.3/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
> Symbol 'term org.eclipse' is missing from the classpath.
> [error] This symbol is required by 'method
> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> [error] Make sure that term eclipse is in your classpath and check for
> conflicting dependencies with `-Ylog-classpath`.
> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
> against an incompatible version of org.
> [error] testUtils.sendMessages(topic, data.toArray)
>
> Killing zinc and rebuilding didn't help.
> But this isn't happening in Jenkins for example, so it should be
> env-specific.
>
> On Wed, May 1, 2019 at 9:39 AM Xiao Li  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.3.
> >
> > The vote is open until May 5th PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.3-rc1 (commit
> c3e32bf06c35ba2580d46150923abfa795b4446a):
> > https://github.com/apache/spark/tree/v2.4.3-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1324/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/
> >
> > The list of bug fixes going into 2.4.2 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12345410
> >
> > The release is using the release script of the branch 2.4.3-rc1 with the
> following commit
> https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.3?
> > ===
> >
> > The current list of open tickets targeted at 2.4.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.4.2

2019-04-21 Thread Wenchen Fan
Yea these should be mentioned in the 2.4.1 release notes.

It seems we only have one ticket that is labeled as "release-notes" for
2.4.2: https://issues.apache.org/jira/browse/SPARK-27419 . I'll mention it
when I write release notes.

On Mon, Apr 22, 2019 at 5:46 AM Sean Owen  wrote:

> One minor comment: for 2.4.1 we had a couple JIRAs marked 'release-notes':
>
> https://issues.apache.org/jira/browse/SPARK-27198?jql=project%20%3D%20SPARK%20and%20fixVersion%20%20in%20(2.4.1%2C%202.4.2)%20and%20labels%20%3D%20%27release-notes%27
>
> They should be mentioned in
> https://spark.apache.org/releases/spark-release-2-4-1.html possibly
> like "Changes of behavior" in
> https://spark.apache.org/releases/spark-release-2-4-0.html
>
> I can retroactively update that page; is this part of the notes for
> the release process though? I missed this one for sure as it's easy to
> overlook with all the pages being updated per release.
>
> On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.2.
> >
> > The vote is open until April 23 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.2-rc1 (commit
> a44880ba74caab7a987128cb09c4bee41617770a):
> > https://github.com/apache/spark/tree/v2.4.2-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1322/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
> >
> > The list of bug fixes going into 2.4.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12344996
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.2?
> > ===
> >
> > The current list of open tickets targeted at 2.4.2 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.2
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>


Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
Did you re-create your df when you update the timezone conf?

On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia 
wrote:

> Writing:
> scala> df.write.orc("")
>
> For looking into contents, I used orc-tools-X.Y.Z-uber.jar (
> https://orc.apache.org/docs/java-tools.html)
>
> On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan  wrote:
>
>> How did you read/write the timestamp value from/to ORC file?
>>
>> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <
>> shubh.chaura...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Consider the following(spark v2.4.0):
>>>
>>> Basically I change values of `spark.sql.session.timeZone` and perform an
>>> orc write. Here are 3 samples:-
>>>
>>> 1)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>>>
>>> scala> val df = sc.parallelize(Seq("2019-04-23
>>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
>>> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 2)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 3)
>>> scala> spark.conf.set("spark.sql.session.timeZone",
>>> "America/Los_Angeles")
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> It can be seen that in all the three cases it stores {"ts":"2019-04-23
>>> 09:15:04.0"} in orc file. I understand that orc file also contains writer
>>> timezone with respect to which spark is able to convert back to actual time
>>> when it reads orc.(and that is equal to df.show())
>>>
>>> But it's problematic in the sense that it is not adjusting(plus/minus)
>>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
>>> 09:15:04.0"} in ORC file. I mean loading data to any system other than
>>> spark would be a problem.
>>>
>>> Any ideas/suggestions on that?
>>>
>>> PS: For csv files, it stores exactly what we see as the output of
>>> df.show()
>>>
>>> Thanks,
>>> Shubham
>>>
>>>


Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file?

On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia 
wrote:

> Hi All,
>
> Consider the following(spark v2.4.0):
>
> Basically I change values of `spark.sql.session.timeZone` and perform an
> orc write. Here are 3 samples:-
>
> 1)
> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>
> scala> val df = sc.parallelize(Seq("2019-04-23
> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>
> df.show() Output  ORC File Contents
> -
> 2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}
>
> 2)
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>
> df.show() Output  ORC File Contents
> -
> 2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}
>
> 3)
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>
> df.show() Output  ORC File Contents
> -
> 2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}
>
> It can be seen that in all the three cases it stores {"ts":"2019-04-23
> 09:15:04.0"} in orc file. I understand that orc file also contains writer
> timezone with respect to which spark is able to convert back to actual time
> when it reads orc.(and that is equal to df.show())
>
> But it's problematic in the sense that it is not adjusting(plus/minus)
> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
> 09:15:04.0"} in ORC file. I mean loading data to any system other than
> spark would be a problem.
>
> Any ideas/suggestions on that?
>
> PS: For csv files, it stores exactly what we see as the output of df.show()
>
> Thanks,
> Shubham
>
>


Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-02 Thread Wenchen Fan
+1 as well

On Tue, Jul 2, 2019 at 12:13 PM Dongjoon Hyun 
wrote:

> Thank you so much for the replies, Reynold, Sean, Takeshi, Hyukjin!
>
> Bests,
> Dongjoon.
>
> On Mon, Jul 1, 2019 at 6:00 PM Hyukjin Kwon  wrote:
>
>> +1
>>
>> 2019년 7월 2일 (화) 오전 9:39, Takeshi Yamamuro 님이 작성:
>>
>>> I'm also using the script in both cases, anyway +1.
>>>
>>> On Tue, Jul 2, 2019 at 5:58 AM Sean Owen  wrote:
>>>
 I'm using the merge script in both repos. I think that was the best
 practice?
 So, sure, I'm fine with disabling it.

 On Mon, Jul 1, 2019 at 3:53 PM Dongjoon Hyun 
 wrote:
 >
 > Hi, Apache Spark PMC members and committers.
 >
 > We are using GitHub `Merge Button` in `spark-website` repository
 > because it's very convenient.
 >
 > 1. https://github.com/apache/spark-website/commits/asf-site
 > 2. https://github.com/apache/spark/commits/master
 >
 > In order to be consistent with our previous behavior,
 > can we disable `Allow Merge Commits` from GitHub `Merge Button`
 setting explicitly?
 >
 > I hope we can enforce it in both `spark-website` and `spark`
 repository consistently.
 >
 > Bests,
 > Dongjoon.

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: API for SparkContext ?

2019-06-30 Thread Wenchen Fan
You can call `SparkContext#addSparkListener` with a listener that
implements `onApplicationEnd`.

On Tue, May 14, 2019 at 1:51 AM Nasrulla Khan Haris
 wrote:

> HI All,
>
>
>
> Is there a API for sparkContext where we can add our custom code before
> stopping sparkcontext ?
>
> Appreciate your help.
>
> Thanks,
>
> Nasrulla
>
>
>


Re: displaying "Test build" in PR

2019-08-13 Thread Wenchen Fan
"Can one of the admins verify this patch?" is a corrected message, as
Jenkins won't test your PR until an admin approves it.

BTW I think "5 minutes" is a reasonable delay for PR testing. It usually
takes days to review and merge a PR, so I don't think seeing test progress
right after PR creation really matters.

On Tue, Aug 13, 2019 at 8:58 PM Younggyu Chun 
wrote:

> Thank you for your email.
>
> I think a newb like me might want to see what's going on PR and see
> something useful. For example, "Request builder polls every 5 minutes and
> you will see the progress here in a few minutes".  I guess we can add a
> more useful message on AmplabJenkins '
> message instead of a simple message like "Can one of the admins verify
> this patch?"
>
> Younggyu
>
> On Mon, Aug 12, 2019 at 3:55 PM Shane Knapp  wrote:
>
>> when you create a PR, the jenkins pull request builder job polls every ~5
>> or so minutes and will trigger jobs based on creation/approval to test/code
>> updates/etc.
>>
>> On Mon, Aug 12, 2019 at 11:25 AM Younggyu Chun 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have a quick question about PR. Once I create a PR I'm not able to see
>>> if "Test build" is being processed. But I can see this after a few minutes
>>> or hours later. Is it possible to see if "Test Build" is being processed
>>> after PR is created right away?
>>>
>>> Thank you,
>>> Younggyu Chun
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: [SPARK-23207] Repro

2019-08-12 Thread Wenchen Fan
Hi Tyson,

Thanks for reporting it! I quickly checked the related scheduler code but
can't find an obvious place that can go wrong with cached RDD.

Sean said that he can't produce it, but the second job fails. This is
actually expected. We need a lot more changes to completely fix this
problem, so currently the fix is to fail the job if the scheduler needs to
retry an indeterminate shuffle map stage.

It would be great to know if we can reproduce this bug with the master
branch.

Thanks,
Wenchen

On Sun, Aug 11, 2019 at 7:22 AM Xiao Li  wrote:

> Hi, Tyson,
>
> Could you open a new JIRA with correctness label? SPARK-23207 might not
> cover all the scenarios, especially when you using cache.
>
> Cheers,
>
> Xiao
>
> On Fri, Aug 9, 2019 at 9:26 AM  wrote:
>
>> Hi Sean,
>>
>> To finish the job, I did need to set spark.stage.maxConsecutiveAttempts
>> to a large number e.g., 100; a suggestion from Jiang Xingbo.
>>
>> I haven't seen any recent movement/PRs on this issue, but I'll see if we
>> can repro with a more recent version of Spark.
>>
>> Best regards,
>> Tyson
>>
>> -Original Message-
>> From: Sean Owen 
>> Sent: Friday, August 9, 2019 7:49 AM
>> To: tcon...@gmail.com
>> Cc: dev 
>> Subject: Re: [SPARK-23207] Repro
>>
>> Interesting but I'd put this on the JIRA, and also test vs master first.
>> It's entirely possible this is something else that was subsequently fixed,
>> and maybe even backported for 2.4.4.
>> (I can't quite reproduce it - just makes the second job fail, which is
>> also puzzling)
>>
>> On Fri, Aug 9, 2019 at 8:11 AM  wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > We are able to reproduce this bug in Spark 2.4 using the following
>> program:
>> >
>> >
>> >
>> > import scala.sys.process._
>> >
>> > import org.apache.spark.TaskContext
>> >
>> >
>> >
>> > val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000,
>> > x)}.repartition(20)
>> >
>> > res.distinct.count
>> >
>> >
>> >
>> > // kill an executor in the stage that performs repartition(239)
>> >
>> > val df = res.repartition(113).cache.repartition(239).map { x =>
>> >
>> >   if (TaskContext.get.attemptNumber == 0 &&
>> > TaskContext.get.partitionId < 1) {
>> >
>> > throw new Exception("pkill -f java".!!)
>> >
>> >   }
>> >
>> >   x
>> >
>> > }
>> >
>> > df.distinct.count()
>> >
>> >
>> >
>> > The first df.distinct.count correctly produces 1
>> >
>> > The second df.distinct.count incorrect produces 9769
>> >
>> >
>> >
>> > If the cache step is removed then the bug does not reproduce.
>> >
>> >
>> >
>> > Best regards,
>> >
>> > Tyson
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1

On Wed, Aug 14, 2019 at 12:52 PM Holden Karau  wrote:

> +1
> Does anyone have any critical fixes they’d like to see in 2.4.4?
>
> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>
>> Seems fine to me if there are enough valuable fixes to justify another
>> release. If there are any other important fixes imminent, it's fine to
>> wait for those.
>>
>>
>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released three months ago (8th May).
>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>> `branch-24` since 2.4.3.
>> >
>> > It would be great if we can have Spark 2.4.4.
>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>> >
>> > Last time, there was a request for K8s issue and now I'm waiting for
>> SPARK-27900.
>> > Please let me know if there is another issue.
>> >
>> > Thanks,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Wenchen Fan
We can add a config for a certain behavior if it makes sense, but the most
important thing we want to reach an agreement here is: what should be the
default behavior?

Let's explore the solution space of table insertion behavior first:
At compile time,
1. always add cast
2. add cast following the ASNI SQL store assignment rule (e.g. string to
int is forbidden but long to int is allowed)
3. only add cast if it's 100% safe
At runtime,
1. return null for invalid operations
2. throw exceptions at runtime for invalid operations

The standards to evaluate a solution:
1. How robust the query execution is. For example, users usually don't want
to see the query fails midway.
2. how tolerant to user queries. For example, a user would like to write
long values to an int column as he knows all the long values won't exceed
int range.
3. How clean the result is. For example, users usually don't want to see
silently corrupted data (null values).

The current Spark behavior for Data Source V1 tables: always add cast and
return null for invalid operations. This maximizes standard 1 and 2, but
the result is least clean and users are very likely to see silently
corrupted data (null values).

The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
queries may fail to compile, even if these queries can run on other SQL
systems. Note that, people can still see silently corrupted data because
cast is not the only one that can return corrupted data. Simple operations
like ADD can also return corrected data if overflow happens. e.g. INSERT
INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2

The proposal here: add cast following ANSI SQL store assignment rule, and
return null for invalid operations. This maximizes standard 1, and also
fits standard 2 well: if a query can't compile in Spark, it usually can't
compile in other mainstream databases as well. I think that's tolerant
enough. For standard 3, this proposal doesn't maximize it but can avoid
many invalid operations already.

Technically we can't make the result 100% clean at compile-time, we have to
handle things like overflow at runtime. I think the new proposal makes more
sense as the default behavior.


On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer 
wrote:

> I understand spark is making the decisions, i'm say the actual final
> effect of the null decision would be different depending on the insertion
> target if the target has different behaviors for null.
>
> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan  wrote:
>
>> > I'm a big -1 on null values for invalid casts.
>>
>> This is why we want to introduce the ANSI mode, so that invalid cast
>> fails at runtime. But we have to keep the null behavior for a while, to
>> keep backward compatibility. Spark returns null for invalid cast since the
>> first day of Spark SQL, we can't just change it without a way to restore to
>> the old behavior.
>>
>> I'm OK with adding a strict mode for the upcast behavior in table
>> insertion, but I don't agree with making it the default. The default
>> behavior should be either the ANSI SQL behavior or the legacy Spark
>> behavior.
>>
>> > other modes should be allowed only with strict warning the behavior
>> will be determined by the underlying sink.
>>
>> Seems there is some misunderstanding. The table insertion behavior is
>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>> whether invalid cast should return null or fail. The sink is only
>> responsible for writing data, not the type coercion/cast stuff.
>>
>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm a big -1 on null values for invalid casts. This can lead to a lot of
>>> even more unexpected errors and runtime behavior since null is
>>>
>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>> 2. Is the same as delete in some systems (leading to data loss)
>>>
>>> And this would be dependent on the sink being used. Spark won't just be
>>> interacting with ANSI compliant sinks so I think it makes much more sense
>>> to be strict. I think Upcast mode is a sensible default and other modes
>>> should be allowed only with strict warning the behavior will be determined
>>> by the underlying sink.
>>>
>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro 
>>> wrote:
>>>
>>>> Hi, all
>>>>
>>>> +1 for implementing this new store cast mode.
>>>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>>>> and I think this functionality could
>>>> promote migrati

Re: [build system] colo maintenance & outage tomorrow, 10am-2pm PDT

2019-08-15 Thread Wenchen Fan
Thanks for tracking it Shane!

On Fri, Aug 16, 2019 at 7:41 AM Shane Knapp  wrote:

> just got an update:
>
> there was a problem w/the replacement part, and they're trying to fix it.
> if that's successful, the expect to have power restored within the hour.
>
> if that doesn't work, a new (new) replacement part is scheduled to arrive
> at 8am tomorrow.
>
> shane
>
> On Thu, Aug 15, 2019 at 2:07 PM Shane Knapp  wrote:
>
>> quick update:
>>
>> it's been 4 hours, the colo is still down, and i haven't gotten any news
>> yet as to when they're planning on getting power restored.
>>
>> once i hear something i will let everyone know what's up.
>>
>> On Wed, Aug 14, 2019 at 10:22 AM Shane Knapp  wrote:
>>
>>> the berkeley colo had a major power distribution breaker fail, and
>>> they've scheduled an emergency repair for tomorrow (thursday) @ 10am.
>>>
>>> they expect this to take ~4 hours.
>>>
>>> i will be shutting down the machines (again) ~9am, and bringing them all
>>> back up once i get the all-clear.
>>>
>>> sorry for the inconvenience...
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Release Apache Spark 2.4.4 (RC1)

2019-08-19 Thread Wenchen Fan
Unfortunately, I need to -1.

Recently we found that the repartition correctness bug can still be
reproduced. The root cause has been identified and there are 2 PRs to fix 2
related issues:
https://github.com/apache/spark/pull/25491
https://github.com/apache/spark/pull/25498

I think we should have this fix in 2.3 and 2.4.

Thanks,
Wenchen

On Tue, Aug 20, 2019 at 7:32 AM Dongjoon Hyun 
wrote:

> Thank you for testing, Sean and Herman.
>
> There are three reporting until now.
>
> 1. SPARK-28775 is for JDK 8u221+ testing at Apache Spark 3.0/2.4/2.3.
> 2. SPARK-28749 is for Scala 2.12 Python testing at Apache Spark 2.4 only.
> 3. SPARK-28699 is for disabling radix sort for ShuffleExchangeExec at
> Apache Spark 3.0/2.4/2.3.
>
> Both (1) and (2) are nice-to-have and test-only fixes. (3) could be a
> correctness issue, but it seems that there are some other approaches.
> I'm monitoring all reports. Let's see. For now, I'd like to continue 2.4.4
> RC1 voting for more testing.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Aug 19, 2019 at 2:09 PM Herman van Hovell 
> wrote:
>
>> The error you are seeing is caused by
>> https://issues.apache.org/jira/browse/SPARK-28775.
>>
>>
>> On Mon, Aug 19, 2019 at 10:40 PM Sean Owen  wrote:
>>
>>> Things are looking pretty good so far, but a few notes:
>>>
>>> I thought we might need this PR to make the 2.12 build of 2.4.x not
>>> try to build Kafka 0.8 support, but, I'm not seeing that 2.4.x + 2.12
>>> builds or tests it?
>>> https://github.com/apache/spark/pull/25482
>>> I can merge this to 2.4 shortly anyway, but not clear it affects the RC.
>>>
>>>
>>> I'm getting one weird failure in tests:
>>>
>>> - daysToMillis and millisToDays *** FAILED ***
>>>   8634 did not equal 8633 Round trip of 8633 did not work in tz
>>>
>>> sun.util.calendar.ZoneInfo[id="Kwajalein",offset=4320,dstSavings=0,useDaylight=false,transitions=8,lastRule=null]
>>> (DateTimeUtilsSuite.scala:683)
>>>
>>> See
>>> https://github.com/apache/spark/pull/19234#pullrequestreview-64463435
>>> for some context and
>>>
>>> https://github.com/apache/spark/commit/c5b8d54c61780af6e9e157e6c855718df972efad
>>> for a fix for a similar type of issue.
>>>
>>> This may be quite specific to a particular version of Java 8, but I'm
>>> testing on the latest (1.8.0_222). We can 'patch' it by allowing for
>>> multiple correct answers here.
>>> It may not hold up the RC unless others see the failure, but I can
>>> work on that anyway.
>>>
>>> On Mon, Aug 19, 2019 at 11:55 AM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.4.
>>> >
>>> > The vote is open until August 22nd 10AM PST and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.4.4
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.4.4-rc1 (commit
>>> 13f2465c6c8328e988f7215ee5f5d2c5e69e8d21):
>>> > https://github.com/apache/spark/tree/v2.4.4-rc1
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc1-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1326/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc1-docs/
>>> >
>>> > The list of bug fixes going into 2.4.4 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
>>> >
>>> > This release is using the release script of the tag v2.4.4-rc1.
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.4.4?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.4.4 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.4
>>> >
>>> > Committers should look 

Re: Release Spark 2.3.4

2019-08-18 Thread Wenchen Fan
+1

On Sat, Aug 17, 2019 at 3:37 PM Hyukjin Kwon  wrote:

> +1 too
>
> 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성:
>
>> +1
>>
>> Regards,
>> Dilip Biswal
>> Tel: 408-463-4980
>> dbis...@us.ibm.com
>>
>>
>>
>> - Original message -
>> From: John Zhuge 
>> To: Xiao Li 
>> Cc: Takeshi Yamamuro , Spark dev list <
>> dev@spark.apache.org>, Kazuaki Ishizaki 
>> Subject: [EXTERNAL] Re: Release Spark 2.3.4
>> Date: Fri, Aug 16, 2019 4:33 PM
>>
>> +1
>>
>> On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
>>
>> +1
>>
>> On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro 
>> wrote:
>>
>> +1, too
>>
>> Bests,
>> Takeshi
>>
>> On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun 
>> wrote:
>>
>> +1 for 2.3.4 release as the last release for `branch-2.3` EOL.
>>
>> Also, +1 for next week release.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
>>
>> I think it's fine to do these in parallel, yes. Go ahead if you are
>> willing.
>>
>> On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.3.3 was released six months ago (15th February, 2019) at
>> http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18
>> months have been passed after Spark 2.3.0 has been released (28th February,
>> 2018).
>> > As of today (16th August), there are 103 commits (69 JIRAs) in
>> `branch-23` since 2.3.3.
>> >
>> > It would be great if we can have Spark 2.3.4.
>> > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after
>> 2.4.4 will be released?
>> >
>> > A issue list in jira:
>> https://issues.apache.org/jira/projects/SPARK/versions/12344844
>> > A commit list in github from the last release:
>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
>> > The 8 correctness issues resolved in branch-2.3:
>> >
>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>> >
>> > Best Regards,
>> > Kazuaki Ishizaki
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>>
>>
>> --
>> John Zhuge
>>
>>
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Apache Spark git repo moved to gitbox.apache.org

2019-08-26 Thread Wenchen Fan
yea I think we should, but no need to worry too much about it because
gitbox still works in the release scripts.

On Tue, Aug 27, 2019 at 3:23 AM Shane Knapp  wrote:

> revisiting this old thread...
>
> i noticed from the committers' page on the spark site that the 'apache'
> remote should be 'github.com', and not 'gitbox' as instructed here.
>
> so, i did a quick check of the spark repo and found we're still
> referencing gitbox in a few places:
> ➜  spark git:(fix-run-tests) grep -r gitbox *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="
> gitbox.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="
> https://gitbox.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="
> https://gitbox.apache.org/repos/asf?p=spark.git;
> pom.xml:scm:git:
> https://gitbox.apache.org/repos/asf/spark.git
>
> should we update these entries to point to github vs gitbox?
>
> On Mon, Dec 10, 2018 at 8:30 AM Sean Owen  wrote:
>
>> Per the thread last week, the Apache Spark repos have migrated from
>> https://git-wip-us.apache.org/repos/asf to
>> https://gitbox.apache.org/repos/asf
>>
>>
>> Non-committers:
>>
>> This just means repointing any references to the old repository to the
>> new one. It won't affect you if you were already referencing
>> https://github.com/apache/spark .
>>
>>
>> Committers:
>>
>> Follow the steps at https://reference.apache.org/committer/github to
>> fully sync your ASF and Github accounts, and then wait up to an hour
>> for it to finish.
>>
>> Then repoint your git-wip-us remotes to gitbox in your git checkouts.
>> For our standard setup that works with the merge script, that should
>> be your 'apache' remote. For example here are my current remotes:
>>
>> $ git remote -v
>> apache https://gitbox.apache.org/repos/asf/spark.git (fetch)
>> apache https://gitbox.apache.org/repos/asf/spark.git (push)
>> apache-github git://github.com/apache/spark (fetch)
>> apache-github git://github.com/apache/spark (push)
>> origin https://github.com/srowen/spark (fetch)
>> origin https://github.com/srowen/spark (push)
>> upstream https://github.com/apache/spark (fetch)
>> upstream https://github.com/apache/spark (push)
>>
>> In theory we also have read/write access to github.com now too, but
>> right now it hadn't yet worked for me. It may need to sync. This note
>> just makes sure anyone knows how to keep pushing commits right now to
>> the new ASF repo.
>>
>> Report any problems here!
>>
>> Sean
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-27 Thread Wenchen Fan
+1

On Wed, Aug 28, 2019 at 2:43 AM DB Tsai  wrote:

> +1
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun 
> wrote:
> >
> > +1.
> >
> > I also verified SHA/GPG and tested UTs on AdoptOpenJDKu8_222/CentOS6.9
> with profile
> > "-Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive
> -Phive-thriftserver"
> >
> > Additionally, JDBC IT also is tested.
> >
> > Thank you, Kazuaki!
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Tue, Aug 27, 2019 at 11:20 AM Sean Owen  wrote:
> >>
> >> +1 - license and signature looks OK, the docs look OK, the artifacts
> >> seem to be in order. Tests passed for me when building from source
> >> with most common profiles set.
> >>
> >> On Mon, Aug 26, 2019 at 3:28 PM Kazuaki Ishizaki 
> wrote:
> >> >
> >> > Please vote on releasing the following candidate as Apache Spark
> version 2.3.4.
> >> >
> >> > The vote is open until August 29th 2PM PST and passes if a majority
> +1 PMC votes are cast, with
> >> > a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 2.3.4
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v2.3.4-rc1 (commit
> 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
> >> > https://github.com/apache/spark/tree/v2.3.4-rc1
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1331/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> > https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/
> >> >
> >> > The list of bug fixes going into 2.3.4 can be found at the following
> URL:
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12344844
> >> >
> >> > FAQ
> >> >
> >> > =
> >> > How can I help test this release?
> >> > =
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> > an existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> > you can add the staging repository to your projects resolvers and test
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> > ===
> >> > What should happen to JIRA tickets still targeting 2.3.4?
> >> > ===
> >> >
> >> > The current list of open tickets targeted at 2.3.4 can be found at:
> >> > https://issues.apache.org/jira/projects/SPARKand search for "Target
> Version/s" = 2.3.4
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> > be worked on immediately. Everything else please retarget to an
> >> > appropriate release.
> >> >
> >> > ==
> >> > But my bug isn't fixed?
> >> > ==
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> > release unless the bug in question is a regression from the previous
> >> > release. That being said, if there is something which is a regression
> >> > that has not been correctly targeted please ping me or a committer to
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Wenchen Fan
+1, no more blocking issues that I'm aware of.

On Wed, Aug 28, 2019 at 8:33 PM Sean Owen  wrote:

> +1 from me again.
>
> On Tue, Aug 27, 2019 at 6:06 PM Dongjoon Hyun 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.4.
> >
> > The vote is open until August 30th 5PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.4-rc3 (commit
> 7955b3962ac46b89564e0613db7bea98a1478bf2):
> > https://github.com/apache/spark/tree/v2.4.4-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1332/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/
> >
> > The list of bug fixes going into 2.4.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
> >
> > This release is using the release script of the tag v2.4.4-rc3.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.4?
> > ===
> >
> > The current list of open tickets targeted at 2.4.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: JDK11 Support in Apache Spark

2019-08-25 Thread Wenchen Fan
Great work!

On Sun, Aug 25, 2019 at 6:03 AM Xiao Li  wrote:

> Thank you for your contributions! This is a great feature for Spark
> 3.0! We finally achieve it!
>
> Xiao
>
> On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung 
> wrote:
>
>> That’s great!
>>
>> --
>> *From:* ☼ R Nair 
>> *Sent:* Saturday, August 24, 2019 10:57:31 AM
>> *To:* Dongjoon Hyun 
>> *Cc:* dev@spark.apache.org ; user @spark/'user
>> @spark'/spark users/user@spark 
>> *Subject:* Re: JDK11 Support in Apache Spark
>>
>> Finally!!! Congrats
>>
>> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Thanks to your many many contributions,
>>> Apache Spark master branch starts to pass on JDK11 as of today.
>>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>>> (JDK11 is used for building and testing.)
>>>
>>> We already verified all UTs (including PySpark/SparkR) before.
>>>
>>> Please feel free to use JDK11 in order to build/test/run `master` branch
>>> and
>>> share your experience including any issues. It will help Apache Spark
>>> 3.0.0 release.
>>>
>>> For the follow-ups, please follow
>>> https://issues.apache.org/jira/browse/SPARK-24417 .
>>> The next step is `how to support JDK8/JDK11 together in a single
>>> artifact`.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wenchen Fan
Great! Thanks!

On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun 
wrote:

> We are happy to announce the availability of Spark 2.4.4!
>
> Spark 2.4.4 is a maintenance release containing stability fixes. This
> release is based on the branch-2.4 maintenance branch of Spark. We strongly
> recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.4, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or
> to use `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-4.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan
I think it's too risky to enable the "runtime exception" mode by default in
the next release. We don't even have a spec to describe when Spark would
throw runtime exceptions. Currently the "runtime exception" mode works for
overflow but I believe there are more places need to be considered (e.g.
divide by zero).

However, Ryan has a good point that if we use the ANSI store assignment
policy, we should make sure the table insertion behavior completely follows
the SQL spec. After reading the related section in the SQL spec, the rule
is to throw runtime exception for value out of range, which is the overflow
check we already have in Spark. I think we should enable the overflow
check during table insertion, when ANSI policy is picked. This should be
done no matter which policy becomes the default eventually.

On Mon, Sep 9, 2019 at 8:00 AM Felix Cheung 
wrote:

> I’d prefer strict mode and fail fast (analysis check)
>
> Also I like what Alastair suggested about standard clarification.
>
> I think we can re-visit this proposal and restart the vote
>
> --
> *From:* Ryan Blue 
> *Sent:* Friday, September 6, 2019 5:28 PM
> *To:* Alastair Green
> *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
> *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in
> table insertion by default
>
>
> We discussed this thread quite a bit in the DSv2 sync up and Russell
> brought up a really good point about this.
>
> The ANSI rule used here specifies how to store a specific value, V, so
> this is a runtime rule — an earlier case covers when V is NULL, so it is
> definitely referring to a specific value. The rule requires that if the
> type doesn’t match or if the value cannot be truncated, an exception is
> thrown for “numeric value out of range”.
>
> That runtime error guarantees that even though the cast is introduced at
> analysis time, unexpected NULL values aren’t inserted into a table in place
> of data values that are out of range. Unexpected NULL values are the
> problem that was concerning to many of us in the discussion thread, but it
> turns out that real ANSI behavior doesn’t have the problem. (In the sync,
> we validated this by checking Postgres and MySQL behavior, too.)
>
> In Spark, the runtime check is a separate configuration property from this
> one, but in order to actually implement ANSI semantics, both need to be
> set. So I think it makes sense to*change both defaults to be ANSI*. The
> analysis check alone does not implement the ANSI standard.
>
> In the sync, we also agreed that it makes sense to be able to turn off the
> runtime check in order to avoid job failures. Another, safer way to avoid
> job failures is to require an explicit cast, i.e., strict mode.
>
> I think that we should amend this proposal to change the default for both
> the runtime check and the analysis check to ANSI.
>
> As this stands now, I vote -1. But I would support this if the vote were
> to set both runtime and analysis checks to ANSI mode.
>
> rb
>
> On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
>  wrote:
>
>> Makes sense.
>>
>> While the ISO SQL standard automatically becomes an American national
>>  (ANSI) standard, changes are only made to the International (ISO/IEC)
>> Standard, which is the authoritative specification.
>>
>> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section
>> 9.2.
>>
>> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>>
>> — Alastair
>>
>> On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
>>
>> Having three modes is a lot. Why not just use ansi mode as default, and
>> legacy for backward compatibility? Then over time there's only the ANSI
>> mode, which is standard compliant and easy to understand. We also don't
>> need to invent a standard just for Spark.
>>
>>
>> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
>> wrote:
>>
>>> +1
>>>
>>> To be honest I don't like the legacy policy. It's too loose and easy for
>>> users to make mistakes, especially when Spark returns null if a function
>>> hit errors like overflow.
>>>
>>> The strict policy is not good either. It's too strict and stops valid
>>> use cases like writing timestamp values to a date type column. Users do
>>> expect truncation to happen without adding cast manually in this case. It's
>>> also weird to use a spark specific policy that no other database is using.
>>>
>>> The ANSI policy is better. It stops invalid use cases like writing
>>> string values to an int type column, while keeping valid use cases like
>>>

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Wenchen Fan
Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure
mode for reading text-based data source (json, csv, etc.). It's not the
general failure mode for Spark table insertion.

I agree with you that the PERMISSIVE mode is hard to use. Feel free to open
a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas 
wrote:

> A quick question about failure modes, as a casual observer of the DSv2
> effort:
>
> I was considering filing a JIRA ticket about enhancing the DataFrameReader
> to include the failure *reason* in addition to the corrupt record when
> the mode is PERMISSIVE. So if you are loading a CSV, for example, and a
> value cannot be automatically cast to the type you specify in the schema,
> you'll get the corrupt record in the column configured by
> columnNameOfCorruptRecord, but you'll also get some detail about what
> exactly made the record corrupt, perhaps in a new column specified by
> something like columnNameOfCorruptReason.
>
> Is this an enhancement that would be possible in DSv2?
>
> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue 
> wrote:
>
>> Here are my notes from the latest sync. Feel free to reply with
>> clarifications if I’ve missed anything.
>>
>> *Attendees*:
>>
>> Ryan Blue
>> John Zhuge
>> Russell Spitzer
>> Matt Cheah
>> Gengliang Wang
>> Priyanka Gomatam
>> Holden Karau
>>
>> *Topics*:
>>
>>- DataFrameWriterV2 insert vs append (recap)
>>- ANSI and strict modes for inserting casts
>>- Separating identifier resolution from table lookup
>>- Open PRs
>>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>   - TableProvider API update -
>>   https://github.com/apache/spark/pull/25651
>>   - UPDATE - https://github.com/apache/spark/pull/25626
>>
>> *Discussion*:
>>
>>- DataFrameWriterV2 insert vs append discussion recapped the
>>agreement from last sync
>>- ANSI and strict modes for inserting casts:
>>   - Russell: Failure modes are important. ANSI behavior is to fail
>>   at runtime, not analysis time. If a cast is allowed, but doesn’t throw 
>> an
>>   exception at runtime then this can’t be considered ANSI behavior.
>>   - Gengliang: ANSI adds the cast
>>   - Matt: Sounds like there are two conflicting views of the world.
>>   Is the default ANSI behavior to insert a cast that may produce NULL or 
>> to
>>   fail at runtime?
>>   - Ryan: So analysis and runtime behaviors can’t be separate?
>>   - Matt: Analysis behavior is influenced by behavior at runtime.
>>   Maybe the vote should cover both?
>>   - Russell: (linked to the standard) There are 3 steps: if numeric
>>   and same type, use the data value. If the value can be rounded or
>>   truncated, round or truncate. Otherwise, throw an exception that a 
>> value
>>   can’t be cast. These are runtime requirements.
>>   - Ryan: Another consideration is that we can make Spark more
>>   permissive, but can’t make Spark more strict in future releases.
>>   - Matt: v1 silently corrupts data
>>   - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>>   Don’t tell people it’s ANSI and not do ANSI completely.
>>   - Gengliang: people are concerned about long-running jobs failing
>>   at the end
>>   - Ryan: That’s okay because they can change the defaults: use
>>   strict analysis-time validation, or allow casts to produce NULL values.
>>   - Matt: As long as this is well documented, it should be fine
>>   - Ryan: Can we run tests to find out what exactly the behavior is?
>>   - Gengliang: sqlfiddle.com
>>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>>   failures.
>>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>- Identifier resolution and table lookup
>>   - Ryan: recent changes merged identifier resolution and table
>>   lookup together because identifiers owned by the session catalog need 
>> to be
>>   loaded to find out whether to use v1 or v2 plans. I think this should 
>> be
>>   separated so that identifier resolution happens independently to ensure
>>   that the two separate tasks don’t end up getting done at the same time 
>> and
>>   over-complicating the analyzer.
>>- SHOW NAMESPACES - Ready for final review
>>- DataFrameWriterV2:
>>   - Ryan: Tests failed after passing on the PR. Anyone know why that
>>   would happen?
>>   - Gengliang: tests failed in maven
>>   - Holden: PR validation runs SBT tests
>>- TableProvider API update: skipped because Wenchen didn’t make it
>>- UPDATE support PR
>>   - Ryan: There is a PR to add a SQL UPDATE command, but it
>>   delegates entirely to the data source, which seems strange.
>>   - Matt: What is Spark’s purpose here? Why would 

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Wenchen Fan
Congratulations!

On Tue, Sep 10, 2019 at 10:19 AM Yuanjian Li  wrote:

> Congratulations!
>
> sujith chacko  于2019年9月10日周二 上午10:15写道:
>
>> Congratulations all.
>>
>> On Tue, 10 Sep 2019 at 7:27 AM, Haibo  wrote:
>>
>>> congratulations~
>>>
>>>
>>>
>>> 在2019年09月10日 09:30,Joseph Torres
>>>  写道:
>>>
>>> congratulations!
>>>
>>> On Mon, Sep 9, 2019 at 6:27 PM 王 斐  wrote:
>>>
 congratulations!

 获取 Outlook for iOS 

 --
 *发件人:* Ye Xianjin 
 *发送时间:* 星期二, 九月 10, 2019 09:26
 *收件人:* Jeff Zhang
 *抄送:* Saisai Shao; dev
 *主题:* Re: Welcoming some new committers and PMC members

 Congratulations!

 Sent from my iPhone

 On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:

 Congratulations!

 Saisai Shao  于2019年9月10日周二 上午9:16写道:

> Congratulations!
>
> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>
>> Congratulations! Well deserved!
>>
>> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
>>
>>> Congratulations!
>>>
>>> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp 
>>> wrote:
>>>
 congrats everyone!  :)

 On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia <
 matei.zaha...@gmail.com> wrote:
 >
 > Hi all,
 >
 > The Spark PMC recently voted to add several new committers and
 one PMC member. Join me in welcoming them to their new roles!
 >
 > New PMC member: Dongjoon Hyun
 >
 > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang,
 Yuming Wang, Weichen Xu, Ruifeng Zheng
 >
 > The new committers cover lots of important areas including ML,
 SQL, and data sources, so it’s great to have them here. All the best,
 >
 > Matei and the Spark PMC
 >
 >
 >
 -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >


 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>> --
>>> John Zhuge
>>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>

 --
 Best Regards

 Jeff Zhang




Re: [DISCUSS][SPIP][SPARK-29031] Materialized columns

2019-09-15 Thread Wenchen Fan
> 1. It is a waste of IO. The whole column (in Map format) should be read
and Spark extract the required keys from the map, even though the query
requires only one or a few keys in the map

This sounds like a similar use case to nested column pruning. We should
push down the map key extracting to the data source.

> 2. Vectorized read can not be exploit. Currently, vectorized read can be
enabled only when all required columns are in atomic type. When a query
read subfield in a complex type column, vectorized read can not be exploit.

I think this is just a current limitation. Technically we can read
complex types with the vectorized reader.

> 3, Filter pushdown can not be utilized. Only when all required fields are
in atomic type can filter pushdown be enabled

This is not true anymore with the new nested column pruning feature. This
only works for parquet right now but we plan to support it in general data
sources.

> 4. CPU is wasted because of duplicated computation.  When JSON is
selected to store all keys, JSON happens each time we query a subfield in
it. However, JSON parse is a CPU intensive operation, especially when the
JSON string is very long.

Similar to 1, this problem goes away if we can push down map key
extracting/array element extracting to the data source.


I agree with the problems you pointed out here, but I don't think
materialized columns is the right solution. I think we should improve the
data source API to allow the data source to fix these problems themselves.

Thanks,
Wenchen



On Tue, Sep 10, 2019 at 5:47 PM Jason Guo  wrote:

> Hi,
>
> I'd like to propose a feature name materialized column. This feature will
> boost queries on complex type columns.
>
> 
>
> https://docs.google.com/document/d/186bzUv4CRwoYY_KliNWTexkNCysQo3VUTLQVrVijyl4/edit?usp=sharing
>
> *Background*
> In data warehouse domain, there is a common requirement to add new fields
> to existing tables. In practice, data engineers usually use complex type,
> such as Map (or they may use JSON), and put all subfields into it.
> However, it may impact the query performance dramatically because
>
>1. It is a waste of IO. The whole column (in Map format) should be
>read and Spark extract the required keys from the map, even though the
>query requires only one or a few keys in the map
>2. Vectorized read can not be exploit. Currently, vectorized read can
>be enabled only when all required columns are in atomic type. When a query
>read subfield in a complex type column, vectorized read can not be exploit
>3. Filter pushdown can not be utilized. Only when all required fields
>are in atomic type can filter pushdown be enabled
>4. CPU is wasted because of duplicated computation.  When JSON is
>selected to store all keys, JSON happens each time we query a subfield in
>it. However, JSON parse is a CPU intensive operation, especially when the
>JSON string is very long
>
>
> *Goal*
>
>- Add a new SQL grammar of Materialized column
>- Implicitly rewrite SQL queries on the complex type of columns if
>there is a materialized columns for it
>- If the data type of the materialized columns is atomic type, even
>though the origin column type is in complex type, enable vectorized read
>and filter pushdown to improve performance
>
>
> *Usage*
> *#1 Add materialized columns to an existing table*
> Step 1: Create a normal table
>
>> CREATE TABLE x (
>> name STRING,
>> age INT,
>> params STRING,
>> event MAP
>> ) USING parquet;
>
>
> Step 2: Add materialized columns to an existing table
>
>> ALTER TABLE x ADD COLUMNS (
>> new_age INT *MATERIALIZED* age + 1,
>> city STRING *MATERIALIZED* get_json_object(params, '$.city'),
>> label STRING *MATERIALIZED* event['label']
>> );
>
>
> *#2 Create a new table with materialized table*
>
>> CREATE TABLE x (
>> name STRING,
>> age INT,
>> params STRING,
>> event MAP,
>> new_age INT MATERIALIZED age + 1,
>> city STRING MATERIALIZED get_json_object(params, '$.city'),
>> label STRING MATERIALIZED event['label']
>> ) USING parquet;
>
>
>
> When issue a query on complex type column as below
> SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM x
> WHERE event['label']='newuser';
>
> It is equivalent to
> SELECT name, new_age, city, label
> FROM x
> WHERE label = 'newuser'
>
> The query performance improved dramatically because
>
>1. The new query (after rewritten) will read the new column city (in
>string type) instead of read the whole map of params(in map string). Much
>lesser data are need to read
>2. Vectorized read can be utilized in the new query and can not be
>used in the old one. Because vectorized read can only be enabled when all
>required columns are in atomic type
>3. Filter can be pushdown. Only filters on atomic column can be
>pushdown. The original filter  event['label'] = 'newuser' is 

Re: Thoughts on Spark 3 release, or a preview release

2019-09-15 Thread Wenchen Fan
I don't expect to see a large DS V2 API change from now on. But we may
update the API a little bit if we find problems during the preview.

On Sat, Sep 14, 2019 at 10:16 PM Sean Owen  wrote:

> I don't think this suggests anything is finalized, including APIs. I
> would not guess there will be major changes from here though.
>
> On Fri, Sep 13, 2019 at 4:27 PM Andrew Melo  wrote:
> >
> > Hi Spark Aficionados-
> >
> > On Fri, Sep 13, 2019 at 15:08 Ryan Blue 
> wrote:
> >>
> >> +1 for a preview release.
> >>
> >> DSv2 is quite close to being ready. I can only think of a couple issues
> that we need to merge, like getting a fix for stats estimation done. I'll
> have a better idea once I've caught up from being away for ApacheCon and
> I'll add this to the agenda for our next DSv2 sync on Wednesday.
> >
> >
> > What does 3.0 mean for the DSv2 API? Does the API freeze at that point,
> or would it still be allowed to change? I'm writing a DSv2 plug-in
> (GitHub.com/spark-root/laurelin) and there's a couple little API things I
> think could be useful, I've just not had time to write here/open a JIRA
> about.
> >
> > Thanks
> > Andrew
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan
+1

To be honest I don't like the legacy policy. It's too loose and easy for
users to make mistakes, especially when Spark returns null if a function
hit errors like overflow.

The strict policy is not good either. It's too strict and stops valid use
cases like writing timestamp values to a date type column. Users do expect
truncation to happen without adding cast manually in this case. It's also
weird to use a spark specific policy that no other database is using.

The ANSI policy is better. It stops invalid use cases like writing string
values to an int type column, while keeping valid use cases like timestamp
-> date.

I think it's no doubt that we should use ANSI policy instead of legacy
policy for v1 tables. Except for backward compatibility, ANSI policy is
literally better than the legacy policy.

The v2 table is arguable here. Although the ANSI policy is better than
strict policy to me, this is just the store assignment policy, which only
partially controls the table insertion behavior. With Spark's "return null
on error" behavior, the table insertion is more likely to insert invalid
null values with the ANSI policy compared to the strict policy.

I think we should use ANSI policy by default for both v1 and v2 tables,
because
1. End-users don't care how the table is implemented. Spark should provide
consistent table insertion behavior between v1 and v2 tables.
2. Data Source V2 is unstable in Spark 2.x so there is no backward
compatibility issue. That said, the baseline to judge which policy is
better should be the table insertion behavior in Spark 2.x, which is the
legacy policy + "return null on error". ANSI policy is better than the
baseline.
3. We expect more and more uses to migrate their data sources to the V2
API. The strict policy can be a stopper as it's a too big breaking change,
which may break many existing queries.

Thanks,
Wenchen


On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-28885 
>  "Follow ANSI store 
> assignment rules in table insertion by default".
> When inserting a value into a column with the different data type, Spark 
> performs type coercion. Currently, we support 3 policies for the type 
> coercion rules: ANSI, legacy and strict, which can be set via the option 
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
> behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
> type conversions such as converting `string` to `int` and `double` to 
> `boolean`.
> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
> which is very loose. E.g., converting either `string` to `int` or `double` to 
> `boolean` is allowed. It is the current behavior in Spark 2.x for 
> compatibility with Hive.
> 3. Strict: Spark doesn't allow any possible precision loss or data truncation 
> in type coercion, e.g., converting either `double` to `int` or `decimal` to 
> `double` is allowed. The rules are originally for Dataset encoder. As far as 
> I know, no maintainstream DBMS is using this policy by default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
> "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
> in Spark 3.0.
>
> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the 
> dev mailing list.
>
> This vote is open until next Thurs (Sept. 12nd).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
I think we need to clarify one thing before further discussion: the
proposal is for the next release but not a long term solution.

IMO the long term solution should be: completely follow SQL standard (store
assignment rule + runtime exception), and provide a variant of functions
that can return null instead of runtime exception. For example, the TRY_CAST
in SQL server
<https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-2017>,
or a more general SAFE prefix of functions in Big Query
<https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators>
.

This proposal is the first step to move forward to the long term solution:
follow the SQL standard store assignment rule. It can help us prevent some
table insertion queries that are very likely to fail, at compile time.

The following steps in my mind are:
* finish the "strict mode"(ANSI SQL mode). It now applies to arithmetic
functions, but we need to apply it to more places like cast.
* introduce the safe version of functions. The safe version always returns
null for invalid input, no matter the strict mode is on or not. We need
some additional work to educate users to use the safe version of the
functions if they rely on the return null behavior.
* turn on the strict mode by default.

Hopefully we can finish it soon, in Spark 3.x.

Thanks,
Wenchen

On Sat, Aug 3, 2019 at 7:07 AM Matt Cheah  wrote:

> *I agree that having both modes and let the user choose the one he/she
> wants is the best option (I don't see big arguments on this honestly). Once
> we have this, I don't see big differences on what is the default. What - I
> think - we still have to work on, is to go ahead with the "strict mode"
> work and provide a more convenient way for users to switch among the 2
> options. I mean: currently we have one flag for throwing exception on
> overflow for operations on decimals, one for doing the same for operations
> on other data types and probably going ahead we will have more. I think in
> the end we will need to collect them all under an "umbrella" flag which
> lets the user simply switch between strict and non-strict mode. I also
> think that we will need to document this very well and give it particular
> attention in our docs, maybe with a dedicated section, in order to provide
> enough visibility on it to end users.*
>
>
>
> I’m +1 on adding a strict mode flag this way, but I’m undecided on whether
> or not we want a separate flag for each of the arithmetic overflow
> situations that could produce invalid results. My intuition is yes, because
> different users have different levels of tolerance for different kinds of
> errors. I’d expect these sorts of configurations to be set up at an
> infrastructure level, e.g. to maintain consistent standards throughout a
> whole organization.
>
>
>
> *From: *Gengliang Wang 
> *Date: *Thursday, August 1, 2019 at 3:07 AM
> *To: *Marco Gaido 
> *Cc: *Wenchen Fan , Hyukjin Kwon ,
> Russell Spitzer , Ryan Blue ,
> Reynold Xin , Matt Cheah ,
> Takeshi Yamamuro , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [Discuss] Follow ANSI SQL on table insertion
>
>
>
> Hi all,
>
>
>
> Let me explain a little bit on the proposal.
>
> By default, we follow the store assignment rules in table insertion. On
> invalid casting, the result is null. It's better than the behavior in Spark
> 2.x while keeping backward-compatibility. It is
>
> If users can't torrent the silently corrupting, they can enable the new
> mode which throws runtime exceptions.
>
> The proposal itself is quite complete. It satisfies different users to
> some degree.
>
>
>
> It is hard to avoid null in data processing anyway. For example,
>
> > select 2147483647 + 1
>
> 2147483647 is the max value of Int. And the result data type of pulsing
> two integers are supposed to be Integer type. Since the value of
> (2147483647 + 1) can't fit into Int, I think Spark return null or throw
> runtime exceptions in such case. (Someone can argue that we can always
> convert the result as wider types, but that's another topic about
> performance and DBMS behaviors)
>
>
>
> So, give a table t with an Int column, *checking data type with Up-Cast
> can't avoid possible null values in the following SQL*, as the result
> data type of (int_column_a + int_column_b) is int type.
>
> >  insert into t select int_column_a + int_column_b from tbl_a, tbl_b;
>
>
>
> Furthermore, if Spark uses Up-Cast and a user's existing ETL job failed
> because of that, what should he/she do then? I think he/she will try adding
> "cast" to queries first. Maybe a project for unifying data schema over all
> data sources has to be done

Re: Re: How to force sorted merge join to broadcast join

2019-07-29 Thread Wenchen Fan
You can try EXPLAIN COST query and see if it works for you.

On Mon, Jul 29, 2019 at 5:34 PM Rubén Berenguel 
wrote:

> I think there is no way of doing that (at least don't remember one right
> now). The closer I remember now, is you can run the SQL "ANALYZE TABLE
> table_name COMPUTE STATISTIC" to compute them regardless of having a query
> (also hints the cost based optimiser if I remember correctly), but as far
> as displaying them it escapes me right now if it can be done.
>
> R
>
> --
> Rubén Berenguel
>
> On 29 July 2019 at 11:03:13, zhangliyun (kelly...@126.com) wrote:
>
> thks! after using the syntax provided in the link, select /*+ BROADCAST
> (A) */ ...  , i got what i want.
> but i want to ask beside using queryExecution.stringWithStats (dataframe
> api) to show the table statistics, is there any way to show the table
> statistics in explain xxx in spark sql command line?
>
> Best Regards
> Kelly
>
>
>
> 在 2019-07-29 14:29:50,"Rubén Berenguel"  写道:
>
> Hi, I hope this answers your question.
>
> You can hint the broadcast in SQL as detailed here:
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html
>  (thanks
> Jacek :) )
> I'd recommend creating a temporary table with the trimming you use in the
> join (for clarity). Also keep in mind using the methods is more
> powerful/readable than
> using Spark SQL directly (as happens with the broadcast case, although it
> depends on personal preference).
>
> Regards,
>
> Ruben
>
> --
> Rubén Berenguel
>
> On 29 July 2019 at 07:12:30, zhangliyun (kelly...@126.com) wrote:
>
> Hi all:
>i want to ask a question about   broadcast join in spark sql.
>
>
> ```
>select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop
> from B
> left join A
> on trim(A.country) = trim(B.cntry_code);
> ```
> here A is a small table only 8 rows, but somehow the statistics of table A
> has problem.
>
> A join B is sort merged join while the join key ( trim(A.country) =
> trim(B.cntry_code)) only
> has serveral values( neary 21 countries).  is there any way i force spark
> sql to use
> broadcast join (I can not use enlarge the
> spark.sql.autoBroadcastJoinThreshold  as i did not know the detail size of
> spark sql deal with it ).
>
> I tried to print the physical plan , but it did not show the table size
> and i did not know
> how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force
> the sort merge join to
> broadcast join.
>
>
> ```
> == Parsed Logical Plan ==
> 'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio *
> 1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369]
> +- 'Join LeftOuter, ('trim('cc_base_part1.country) =
> 'trim('cc_rank_agg.cntry_code))
>:- 'UnresolvedRelation `cc_base_part1`
>+- 'UnresolvedRelation `cc_rank_agg`
>
> == Analyzed Logical Plan ==
> cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string,
> credit_card_created_date: string, card_usage: smallint, cc_category:
> string, cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2),
> dt: string, nsf_ratio_to_pop: decimal(38,6)
> Project [cust_id#372, country#373, cc_id#374, bin_hmac#375,
> credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379,
> nsf_risk#380, nsf_cards_ratio#381, dt#382,
> CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381
> as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2,
> DecimalType(22,4)) as decimal(38,16))) /
> promote_precision(cast(nsf_on_entry#386 as decimal(38,16,
> DecimalType(38,6)) AS nsf_ratio_to_pop#369]
> +- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None))
>:- SubqueryAlias cc_base_part1
>:  +- HiveTableRelation `fpv365h`.`cc_base_part1`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372,
> country#373, cc_id#374, bin_hmac#375, credit_card_created_date#376,
> card_usage#377, cc_category#378, cc_size#379, nsf_risk#380,
> nsf_cards_ratio#381], [dt#382]
>+- SubqueryAlias cc_rank_agg
>   +- HiveTableRelation `fpv365h`.`cc_rank_agg`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383,
> num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387]
>
>
>
> ```
>
> Does spark have any command to show the table size  when printing the
> physical plan ?   Appreciate if you can help my question.
>
>
> Best regards
>
> Kelly Zhang
>
>
>
>
>
>
>
>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-29 Thread Wenchen Fan
> I'm a big -1 on null values for invalid casts.

This is why we want to introduce the ANSI mode, so that invalid cast fails
at runtime. But we have to keep the null behavior for a while, to keep
backward compatibility. Spark returns null for invalid cast since the first
day of Spark SQL, we can't just change it without a way to restore to the
old behavior.

I'm OK with adding a strict mode for the upcast behavior in table
insertion, but I don't agree with making it the default. The default
behavior should be either the ANSI SQL behavior or the legacy Spark
behavior.

> other modes should be allowed only with strict warning the behavior will
be determined by the underlying sink.

Seems there is some misunderstanding. The table insertion behavior is fully
controlled by Spark. Spark decides when to add cast and Spark decided
whether invalid cast should return null or fail. The sink is only
responsible for writing data, not the type coercion/cast stuff.

On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer 
wrote:

> I'm a big -1 on null values for invalid casts. This can lead to a lot of
> even more unexpected errors and runtime behavior since null is
>
> 1. Not allowed in all schemas (Leading to a runtime error anyway)
> 2. Is the same as delete in some systems (leading to data loss)
>
> And this would be dependent on the sink being used. Spark won't just be
> interacting with ANSI compliant sinks so I think it makes much more sense
> to be strict. I think Upcast mode is a sensible default and other modes
> should be allowed only with strict warning the behavior will be determined
> by the underlying sink.
>
> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro 
> wrote:
>
>> Hi, all
>>
>> +1 for implementing this new store cast mode.
>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>> and I think this functionality could
>> promote migrations from existing DBMSs to Spark.
>>
>> The most important thing for DBMS users is that they could optionally
>> choose this mode when inserting data.
>> Therefore, I think it might be okay that the two modes (the current
>> upcast mode and the proposed store cast mode)
>> co-exist for INSERTs. (There is a room to discuss which mode  is enabled
>> by default though...)
>>
>> IMHO we'll provide three behaviours below for INSERTs;
>>  - upcast mode
>>  - ANSI store cast mode and runtime exceptions thrown for invalid values
>>  - ANSI store cast mode and null filled for invalid values
>>
>>
>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> Hi Ryan,
>>>
>>> Thanks for the suggestions on the proposal and doc.
>>> Currently, there is no data type validation in table insertion of V1. We
>>> are on the same page that we should improve it. But using UpCast is from
>>> one extreme to another. It is possible that many queries are broken after
>>> upgrading to Spark 3.0.
>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>> Timestamp type to Date Type, as there will be "precision loss". To me, the
>>> type coercion is reasonable and the "precision loss" is under expectation.
>>> This is very common in other SQL engines.
>>> As long as Spark is following the ANSI SQL store assignment rules, it is
>>> users' responsibility to take good care of the type coercion in data
>>> writing. I think it's the right decision.
>>>
>>> > But the new behavior is only applied in DataSourceV2, so it won’t
>>> affect existing jobs until sources move to v2 and break other behavior
>>> anyway.
>>> Eventually, most sources are supposed to be migrated to DataSourceV2 V2.
>>> I think we can discuss and make a decision now.
>>>
>>> > Fixing the silent corruption by adding a runtime exception is not a
>>> good option, either.
>>> The new optional mode proposed in
>>> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by
>>> default. This should be fine.
>>>
>>>
>>>
>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan 
>>> wrote:
>>>
>>>> I don't agree with handling literal values specially. Although Postgres
>>>> does it, I can't find anything about it in the SQL standard. And it
>>>> introduces inconsistent behaviors which may be strange to users:
>>>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>>>> * The same insert with a decimal column as input will fail even when a
>>>> decimal literal would succeed
>

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Wenchen Fan
I don't agree with handling literal values specially. Although Postgres
does it, I can't find anything about it in the SQL standard. And it
introduces inconsistent behaviors which may be strange to users:
* What about something like "INSERT INTO t SELECT float_col + 1.1"?
* The same insert with a decimal column as input will fail even when a
decimal literal would succeed
* Similar insert queries with "literal" inputs can be constructed through
layers of indirection via views, inline views, CTEs, unions, etc. Would
those decimals be treated as columns and fail or would we attempt to make
them succeed as well? Would users find this behavior surprising?

Silently corrupt data is bad, but this is the decision we made at the
beginning when design Spark behaviors. Whenever an error occurs, Spark
attempts to return null instead of runtime exception. Recently we provide
configs to make Spark fail at runtime for overflow, but that's another
story. Silently corrupt data is bad, runtime exception is bad, and
forbidding all the table insertions that may fail(even with very little
possibility) is also bad. We have to make trade-offs. The trade-offs we
made in this proposal are:
* forbid table insertions that are very like to fail, at compile time.
(things like writing string values to int column)
* allow table insertions that are not that likely to fail. If the data is
wrong, don't fail, insert null.
* provide a config to fail the insertion at runtime if the data is wrong.

>  But the new behavior is only applied in DataSourceV2, so it won’t affect
existing jobs until sources move to v2 and break other behavior anyway.
When users write SQL queries, they don't care if a table is backed by Data
Source V1 or V2. We should make sure the table insertion behavior is
consistent and reasonable. Furthermore, users may even not care if the SQL
queries are run in Spark or other RDBMS, it's better to follow SQL standard
instead of introducing a Spark-specific behavior.

We are not talking about a small use case like allowing writing decimal
literal to float column, we are talking about a big goal to make Spark
compliant to SQL standard, w.r.t.
https://issues.apache.org/jira/browse/SPARK-26217 . This proposal is a
sub-task of it, to make the table insertion behavior follow SQL standard.

On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue  wrote:

> I don’t think this is a good idea. Following the ANSI standard is usually
> fine, but here it would *silently corrupt data*.
>
> From your proposal doc, ANSI allows implicitly casting from long to int
> (any numeric type to any other numeric type) and inserts NULL when a value
> overflows. That would drop data values and is not safe.
>
> Fixing the silent corruption by adding a runtime exception is not a good
> option, either. That puts off the problem until much of the job has
> completed, instead of catching the error at analysis time. It is better to
> catch this earlier during analysis than to run most of a job and then fail.
>
> In addition, part of the justification for using the ANSI standard is to
> avoid breaking existing jobs. But the new behavior is only applied in
> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
> break other behavior anyway.
>
> I think that the correct solution is to go with the existing validation
> rules that require explicit casts to truncate values.
>
> That still leaves the use case that motivated this proposal, which is that
> floating point literals are parsed as decimals and fail simple insert
> statements. We already came up with two alternatives to fix that problem in
> the DSv2 sync and I think it is a better idea to go with one of those
> instead of “fixing” Spark in a way that will corrupt data or cause runtime
> failures.
>
> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan  wrote:
>
>> I have heard about many complaints about the old table insertion
>> behavior. Blindly casting everything will leak the user mistake to a late
>> stage of the data pipeline, and make it very hard to debug. When a user
>> writes string values to an int column, it's probably a mistake and the
>> columns are misordered in the INSERT statement. We should fail the query
>> earlier and ask users to fix the mistake.
>>
>> In the meanwhile, I agree that the new table insertion behavior we
>> introduced for Data Source V2 is too strict. It may fail valid queries
>> unexpectedly.
>>
>> In general, I support the direction of following the ANSI SQL standard.
>> But I'd like to do it with 2 steps:
>> 1. only add cast when the assignment rule is satisfied. This should be
>> the default behavior and we should provide a legacy config to restore to
>> the old behavior.
>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
>> Gaid

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
Ryan, I agree that it needs a VOTE to decide the most reasonable default
setting. But what shall we do for Spark 3.0 if there is no more progress on
this project anymore(just assume)?

In the master branch, we added SQL support for Data Source V2. This exposes
a serious problem: the table insertion behavior is different between tables
from custom catalogs and from the built-in hive catalog. I don't think this
is acceptable for Spark 3.0.

This leaves us with 2 options (assume no more progress on ANSI mode, strict
mode, etc.):
1. change the table insertion behavior of DS v1 to upcast mode, so that
it's the same with DS v2.
2. change the table insertion behavior of DS v2 to always-cast mode, so
that it's the same with DS v1.

I tried option 1 before, it broke a lot of tests and I believe it will
break many user queries as well. Option 2 is not ideal but it's safe: it's
no worse than the last release.

That's why I disagree with "Finish ANSI SQL mode - but do not make it the
default because it is not safe without an option to enable strict mode.".
Yes it's not safe compared to the ideal solution. But it's safe compared to
the last release. We must consider the risk of not able to finish the
"ideal solution" before Spark 3.0.

The runtime exception mode is still under development. The try_cast or safe
methods are not even planned. The upcast mode has serious backward
compatibility problems. To prepare for the worst, I think we can do this
first: create a flag for the upcast mode, and turn it off by default, for
both data source v1 and v2. Then we can work on the following tasks in
parallel and decide the default behavior later according to the progress:
1. implement the store assignment rule
2. finish the runtime exception mode
3. add try_cast or safe methods

Another option is to ask the PMC to vote for blocking Spark 3.0 if the
"return null behavior" is not fixed. But I don't think it's likely to
happen.

On Tue, Aug 6, 2019 at 12:34 AM Ryan Blue  wrote:

> Wenchen, I don’t think we agree on what “strict mode” would mean. Marco is
> talking about strict mode as an extension of the flag for throwing
> exceptions on overflow for decimal operations. That is not ANSI SQL mode.
>
> Also, we need more than ANSI SQL and runtime failure modes. For the
> motivating problem of validating a write, we need a way to preserve the
> analysis-time failure if types don’t match. That, combined with a runtime
> strict mode, is the option that fails fast and guarantees data isn’t lost.
>
> I agree that adding try_cast or safe methods is a good idea.
>
> So here’s a revised set of steps:
>
>1. Finish ANSI SQL mode - but do not make it the default because it is
>not safe without an option to enable strict mode.
>2. Add strict mode for runtime calculations and turn it on by default
>3. Add a flag to control analysis time vs runtime failures (using
>strict mode or ANSI SQL mode) for v2 writes
>
> The choice of whether runtime or analysis time failures should be the
> default for v2 writes is worth a VOTE on this list. Once we agree on what
> modes and options should be available, we can call a vote to build
> consensus around a reasonable set of defaults, given that there are a lot
> of varying opinions on this thread.
>
> On Mon, Aug 5, 2019 at 12:49 AM Wenchen Fan  wrote:
>
>> I think we need to clarify one thing before further discussion: the
>> proposal is for the next release but not a long term solution.
>>
>> IMO the long term solution should be: completely follow SQL standard
>> (store assignment rule + runtime exception), and provide a variant of
>> functions that can return null instead of runtime exception. For example,
>> the TRY_CAST in SQL server
>> <https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-2017>,
>> or a more general SAFE prefix of functions in Big Query
>> <https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators>
>> .
>>
>> This proposal is the first step to move forward to the long term
>> solution: follow the SQL standard store assignment rule. It can help us
>> prevent some table insertion queries that are very likely to fail, at
>> compile time.
>>
>> The following steps in my mind are:
>> * finish the "strict mode"(ANSI SQL mode). It now applies to arithmetic
>> functions, but we need to apply it to more places like cast.
>> * introduce the safe version of functions. The safe version always
>> returns null for invalid input, no matter the strict mode is on or not. We
>> need some additional work to educate users to use the safe version of the
>> functions if they rely on the return null behavior.
>> * turn on the strict m

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Wenchen Fan
I agree with the temp table approach. One idea is: maybe we only need one
temp table, and each task writes to this temp table. At the end we read the
data from the temp table and write it to the target table. AFAIK JDBC can
handle concurrent table writing very well, and it's better than creating
thousands of temp tables for one write job(assume the input RDD has
thousands of partitions).

On Tue, Aug 6, 2019 at 7:57 AM Shiv Prashant Sood 
wrote:

> Thanks all for the clarification.
>
> Regards,
> Shiv
>
> On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue 
> wrote:
>
>> > What you could try instead is intermediate output: inserting into
>> temporal table in executors, and move inserted records to the final table
>> in driver (must be atomic)
>>
>> I think that this is the approach that other systems (maybe sqoop?) have
>> taken. Insert into independent temporary tables, which can be done quickly.
>> Then for the final commit operation, union and insert into the final table.
>> In a lot of cases, JDBC databases can do that quickly as well because the
>> data is already on disk and just needs to added to the final table.
>>
>> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim  wrote:
>>
>>> I asked similar question for end-to-end exactly-once with Kafka, and
>>> you're correct distributed transaction is not supported. Introducing
>>> distributed transaction like "two-phase commit" requires huge change on
>>> Spark codebase and the feedback was not positive.
>>>
>>> What you could try instead is intermediate output: inserting into
>>> temporal table in executors, and move inserted records to the final table
>>> in driver (must be atomic).
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood <
>>> shivprash...@gmail.com> wrote:
>>>
 All,

 I understood that DataSourceV2 supports Transactional write and wanted
 to  implement that in JDBC DataSource V2 connector ( PR#25211
  ).

 Don't see how this is feasible for JDBC based connector.  The FW
 suggest that EXECUTOR send a commit message  to DRIVER, and actual
 commit should only be done by DRIVER after receiving all commit
 confirmations. This will not work for JDBC  as commits have to happen on
 the JDBC Connection which is maintained by the EXECUTORS and
 JDBCConnection  is not serializable that it can be sent to the DRIVER.

 Am i right in thinking that this cannot be supported for JDBC? My goal
 is to either fully write or roll back the dataframe write operation.

 Thanks in advance for your help.

 Regards,
 Shiv

>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: DataSourceV2 sync notes - 10 July 2019

2019-07-23 Thread Wenchen Fan
Hi Ryan,

Thanks for summarizing and sending out the meeting notes! Unfortunately, I
missed the last sync, but the topics are really interesting, especially the
stats integration.

The ideal solution I can think of is to refactor the optimizer/planner and
move all the stats-based optimization to the physical plan phase (or do it
during the planning). This needs a lot of design work and I'm not sure if
we can finish it in the near future.

Alternatively, we can do the operator pushdown at logical plan phase via
the optimizer. This is not ideal but I think is a better workaround than
doing pushdown twice. The parquet nested column pruning is also done at the
logical plan phase, so I think there are no serious problems if we do
operator pushdown at the logical plan phase.

This is only about the internal implementation so we can fix it at any
time. But this may hurt data source v2 performance a lot and we'd better
fix it sooner rather than later.


On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue  wrote:

> Here are my notes from the last sync. If you’d like to be added to the
> invite or have topics, please let me know.
>
> *Attendees*:
>
> Ryan Blue
> Matt Cheah
> Yifei Huang
> Jose Torres
> Burak Yavuz
> Gengliang Wang
> Michael Artz
> Russel Spitzer
>
> *Topics*:
>
>- Existing PRs
>   - V2 session catalog: https://github.com/apache/spark/pull/24768
>   - REPLACE and RTAS: https://github.com/apache/spark/pull/24798
>   - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
>   - ALTER TABLE: https://github.com/apache/spark/pull/24937
>   - INSERT INTO: https://github.com/apache/spark/pull/24832
>- Stats integration
>- CTAS and DataFrameWriter behavior
>
> *Discussion*:
>
>- ALTER TABLE PR is ready to commit (and was after the sync)
>- REPLACE and RTAS PR: waiting on more reviews
>- INSERT INTO PR: Ryan will review
>- DESCRIBE TABLE has test failures, Matt will fix
>- V2 session catalog:
>   - How will v2 catalog be configured?
>   - Ryan: This is up for discussion because it currently uses a table
>   property. I think it needs to be configurable
>   - Burak: Agree that it should be configurable
>   - Ryan: Does this need to be determined now, or can we solve this
>   after getting the functionality in?
>   - Jose: let’s get it in and fix it later
>- Stats integration:
>   - Matt: has anyone looked at stats integration? What needs to be
>   done?
>   - Ryan: stats are part of the Scan API. Configure a scan with
>   ScanBuilder and then get stats from it. The problem is that this happens
>   when converting to physical plan, after the optimizer. But the optimizer
>   determines what gets broadcasted. A work-around Netflix uses is to run 
> push
>   down in the stats code. This runs push-down twice and was rejected from
>   Spark, but is important for performance. We should add a property to 
> enable
>   this.
>   - Ryan: The larger problem is that stats are used in the optimizer,
>   but push-down happens when converting to physical plan. This is also
>   related to our earlier discussions about when join types are chosen. 
> Fixing
>   this is a big project
>- CTAS and DataFrameWriter behavior
>   - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
>   difficult to predict v1 behavior
>   - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
>   suggest a replacement with clear verbs for each SQL action: 
> append/insert,
>   overwrite, overwriteDynamic, create (table), replace (table)
>   - Ryan: Prototype available here:
>   https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-25 Thread Wenchen Fan
I have heard about many complaints about the old table insertion behavior.
Blindly casting everything will leak the user mistake to a late stage of
the data pipeline, and make it very hard to debug. When a user writes
string values to an int column, it's probably a mistake and the columns are
misordered in the INSERT statement. We should fail the query earlier and
ask users to fix the mistake.

In the meanwhile, I agree that the new table insertion behavior we
introduced for Data Source V2 is too strict. It may fail valid queries
unexpectedly.

In general, I support the direction of following the ANSI SQL standard. But
I'd like to do it with 2 steps:
1. only add cast when the assignment rule is satisfied. This should be the
default behavior and we should provide a legacy config to restore to the
old behavior.
2. fail the cast operation at runtime if overflow happens. AFAIK Marco
Gaido is working on it already. This will have a config as well and by
default we still return null.

After doing this, the default behavior will be slightly different from the
SQL standard (cast can return null), and users can turn on the ANSI mode to
fully follow the SQL standard. This is much better than before and should
prevent a lot of user mistakes. It's also a reasonable choice to me to not
throw exceptions at runtime by default, as it's usually bad for
long-running jobs.

Thanks,
Wenchen

On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
gengliang.w...@databricks.com> wrote:

> Hi everyone,
>
> I would like to discuss the table insertion behavior of Spark. In the
> current data source V2, only UpCast is allowed for table insertion. I think
> following ANSI SQL is a better idea.
> For more information, please read the Discuss: Follow ANSI SQL on table
> insertion
> 
> Please let me know if you have any thoughts on this.
>
> Regards,
> Gengliang
>


Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan
> New pushdown API for DataSourceV2

One correction: I want to revisit the pushdown API to make sure it works
for dynamic partition pruning and can be extended to support
limit/aggregate/... pushdown in the future. It should be a small API update
instead of a new API.

On Fri, Sep 20, 2019 at 3:46 PM Xingbo Jiang  wrote:

> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0
> preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is
> collected from the previous discussions in the dev list.
>
>- Followup of the shuffle+repartition correctness issue: support roll
>back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>- Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>https://issues.apache.org/jira/browse/SPARK-23710)
>- JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>- Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075
>)
>- DataSourceV2 features
>   - Enable file source v2 writers (
>   https://issues.apache.org/jira/browse/SPARK-27589)
>   - CREATE TABLE USING with DataSourceV2
>   - New pushdown API for DataSourceV2
>   - Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>   https://issues.apache.org/jira/browse/SPARK-28303)
>- Correctness issue: Stream-stream joins - left outer join gives
>inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>- Revisiting Python / pandas UDF (
>https://issues.apache.org/jira/browse/SPARK-28264)
>- Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
>- Use remote storage for persisting shuffle data (
>https://issues.apache.org/jira/browse/SPARK-25299)
>- Spark + Hadoop + Parquet + Avro compatibility problems (
>https://issues.apache.org/jira/browse/SPARK-25588)
>- Introduce new option to Kafka source - specify timestamp to start
>and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>- Delete files after processing in structured streaming (
>https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features
> are targeting to 3.0 preview release, please prioritize the work and finish
> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
> That means, the community will still work on the features for the upcoming
> Spark 3.0 release, even if they are not included in the preview release.
> The goal of preview release is to collect more feedback from the community
> regarding the new 3.0 features/behavior changes.
>
> Thanks!
>


Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Wenchen Fan
dynamic partition pruning rule generates "hidden" filters that will be
converted to real predicates at runtime, so it doesn't matter where we run
the rule.

For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better
to run it before join reorder.

On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue  wrote:

> Hi everyone,
>
> I have been working on a PR that moves filter and projection pushdown into
> the optimizer for DSv2, instead of when converting to physical plan. This
> will make DSv2 work with optimizer rules that depend on stats, like join
> reordering.
>
> While adding the optimizer rule, I found that some rules appear to be out
> of order. For example, PruneFileSourcePartitions that handles filter
> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that
> will run after all of the batches in Optimizer (spark-catalyst) including
> CostBasedJoinReorder.
>
> SparkOptimizer also adds the new “dynamic partition pruning” rules *after*
> both the cost-based join reordering and the v1 partition pruning rule. I’m
> not sure why this should run after join reordering and partition pruning,
> since it seems to me like additional filters would be good to have before
> those rules run.
>
> It looks like this might just be that the rules were written in the
> spark-sql module instead of in catalyst. That makes some sense for the v1
> pushdown, which is altering physical plan details (FileIndex) that have
> leaked into the logical plan. I’m not sure why the dynamic partition
> pruning rules aren’t in catalyst or why they run after the v1 predicate
> pushdown.
>
> Can someone more familiar with these rules clarify why they appear to be
> out of order?
>
> Assuming that this is an accident, I think it’s something that should be
> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may
> still need to be addressed.
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
Yea codegen can be a good improvement, PRs are welcome!

On Sun, Nov 10, 2019 at 6:28 PM Wang, Gang  wrote:

> That’s right. By default, Spark prefers sort merge join.
>
> While, in our product environment, there are many huge bucket tables. We
> can leverage the bucketing to avoid shuffle when join with other small
> tables (the small tables are not small enough to leverage broad cast join).
> Problem is that, although shuffle can be avoid, sort is still necessary to
> leverage sort merge join (we cannot pre-sort since there are different join
> patterns). For a huge table, sort may take even tens of seconds.
>
> That’s why I’m trying to enable shuffle hash join, and for such cases,
> there were almost 10% ~ 20% improvement when apply shuffle hash join
> instead of sort merge join. I wonder if there is still some space to
> improve shuffle hash join? Like code generation for ShuffledHashJoinExec or
> something….
>
>
>
> *From: *Wenchen Fan 
> *Date: *Sunday, November 10, 2019 at 5:57 PM
> *To: *"Wang, Gang" 
> *Cc: *"dev@spark.apache.org" 
> *Subject: *Re: Why not implement CodegenSupport in class
> ShuffledHashJoinExec?
>
>
>
> By default sort merge join is preferred over shuffle hash join, that's why
> we haven't spend resources to implement codegen for it.
>
>
>
> On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang 
> wrote:
>
> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>
>


Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within
Spark.

On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler  wrote:

> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
>
> The original reason for sorting fields is because kwargs in python < 3.6
> are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _*from_dict*_ that allows the Row fields to to
> be referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
>
> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
> Row(B='2', A='1')>>> 
> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B 
> string, A string").first()
> Row(B='1', A='2')
>
> I think the best way to fix this is to remove the sorting of fields when
> constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
>
> [1] https://github.com/apache/spark/pull/20280
> [2] https://www.python.org/dev/peps/pep-0468/
> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>


Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means.
AFAIK, non-deterministic expressions may produce a different result for the
same input row, if the already processed input rows are different.

The optimizer tries its best to not change the input sequence
of non-deterministic expressions. For example, `df.select(...,
nonDeterministicExpr).filter...` can't do filter pushdown. An exception is
filter condition. For `df.filter(nonDeterministic && cond)`, Spark still
pushes down `cond` even if it may change the input sequence of the first
condition. This is to respect the SQL semantic that filter conditions ANDed
together are order-insensitive. Users should write `
df.filter(nonDeterministic).filter(cond)` to guarantee the order.

For this particular problem, I think it's not only about UDF, but a general
problem with how Spark collapses Projects.
For example, `df.select('a * 5 as 'b).select('b + 2, 'b + 3)`,  Spark
optimizes it to `df.select('a * 5 + 2, 'a * 5 + 3)`, and execute 'a * 5
twice.

I think we should revisit this optimization and think about when we can
collapse.

On Thu, Nov 7, 2019 at 6:20 PM Rubén Berenguel  wrote:

> That was very interesting, thanks Enrico.
>
> Sean, IIRC it also prevents push down of the UDF in Catalyst in some cases.
>
> Regards,
>
> Ruben
>
> > On 7 Nov 2019, at 11:09, Sean Owen  wrote:
> >
> > Interesting, what does non-deterministic do except have this effect?
> > aside from the naming, it could be a fine use of this flag if that's
> > all it effectively does. I'm not sure I'd introduce another flag with
> > the same semantics just over naming. If anything 'expensive' also
> > isn't the right word, more like 'try not to evaluate multiple times'.
> >
> > Why isn't caching the answer? I realize it's big, but you can cache to
> > disk. This may be faster than whatever plan reordering has to happen
> > to evaluate once.
> >
> > Usually I'd say, can you redesign your UDF and code to be more
> > efficient too? or use a big a cluster if that's really what you need.
> >
> > At first look, no I don't think this Spark-side workaround for naming
> > for your use case is worthwhile. There are existing better solutions.
> >
> > On Thu, Nov 7, 2019 at 2:45 AM Enrico Minack 
> wrote:
> >>
> >> Hi all,
> >>
> >> Running expensive deterministic UDFs that return complex types,
> followed by multiple references to those results cause Spark to evaluate
> the UDF multiple times per row. This has been reported and discussed
> before: SPARK-18748 SPARK-17728
> >>
> >>val f: Int => Array[Int]
> >>val udfF = udf(f)
> >>df
> >>  .select($"id", udfF($"id").as("array"))
> >>  .select($"array"(0).as("array0"), $"array"(1).as("array1"))
> >>
> >> A common approach to make Spark evaluate the UDF only once is to cache
> the intermediate result right after projecting the UDF:
> >>
> >>df
> >>  .select($"id", udfF($"id").as("array"))
> >>  .cache()
> >>  .select($"array"(0).as("array0"), $"array"(1).as("array1"))
> >>
> >> There are scenarios where this intermediate result is too big for the
> cluster to cache. Also this is bad design.
> >>
> >> The best approach is to mark the UDF as non-deterministic. Then Spark
> optimizes the query in a way that the UDF gets called only once per row,
> exactly what you want.
> >>
> >>val udfF = udf(f).asNondeterministic()
> >>
> >> However, stating a UDF is non-deterministic though it clearly is
> deterministic is counter-intuitive and makes your code harder to read.
> >>
> >> Spark should provide a better way to flag the UDF. Calling it expensive
> would be a better naming here.
> >>
> >>val udfF = udf(f).asExpensive()
> >>
> >> I understand that deterministic is a notion that Expression provides,
> and there is no equivalent to expensive that is understood by the
> optimizer. However, that asExpensive() could just set the
> ScalaUDF.udfDeterministic = deterministic && !expensive, which implements
> the best available approach behind a better naming.
> >>
> >> What are your thoughts on asExpensive()?
> >>
> >> Regards,
> >> Enrico
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Wenchen Fan
The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is
more stable and we should make releases using 2.7 by default.

+1

On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:

> Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
> Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.
>
> On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:
>
>> This isn't a big thing, but I see that the pyspark build includes
>> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
>> 3.2 by default.
>>
>> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
>> profiles enabled, so I'm +1 on it.
>>
>>
>> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.0.0-preview.
>> >
>> > The vote is open until November 3 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.0.0-preview-rc2 (commit
>> 007c873ae34f58651481ccba30e8e2ba38a692c4):
>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1336/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
>> >
>> > The list of bug fixes going into 3.0.0 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with an out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.0.0?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.0.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems
this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3

AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't
need to add JDK version to the combination.

On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun 
wrote:

> Thank you for suggestion.
>
> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
>
> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
>
> Bests,
> Dongjoon.
>
>
>
> On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian  wrote:
>
>> Cc Yuming, Steve, and Dongjoon
>>
>> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian 
>> wrote:
>>
>>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>>> Hadoop version is quality control. The current hadoop-3.2 profile
>>> covers too many major component upgrades, i.e.:
>>>
>>>- Hadoop 3.2
>>>- Hive 2.3
>>>- JDK 11
>>>
>>> We have already found and fixed some feature and performance regressions
>>> related to these upgrades. Empirically, I’m not surprised at all if more
>>> regressions are lurking somewhere. On the other hand, we do want help from
>>> the community to help us to evaluate and stabilize these new changes.
>>> Following that, I’d like to propose:
>>>
>>>1.
>>>
>>>Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>>Hadoop/Hive/JDK version combinations.
>>>
>>>This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>>profile, so that users may try out some less risky Hadoop/Hive/JDK
>>>combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>>face potential regressions introduced by the Hadoop 3.2 upgrade.
>>>
>>>Yuming Wang has already sent out PR #26533
>>> to exercise the Hadoop
>>>2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>>hive-2.3 profile yet), and the result looks promising: the Kafka
>>>streaming and Arrow related test failures should be irrelevant to the 
>>> topic
>>>discussed here.
>>>
>>>After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>>lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>>Hadoop version. For users who are still using Hadoop 2.x in production,
>>>they will have to use a hadoop-provided prebuilt package or build
>>>Spark 3.0 against their own 2.x version anyway. It does make a difference
>>>for cloud users who don’t use Hadoop at all, though. And this probably 
>>> also
>>>helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>>will exercise it regularly.
>>>2.
>>>
>>>Defer Hadoop 2.x upgrade to Spark 3.1+
>>>
>>>I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>>2.10. Steve has already stated the benefits very well. My worry here is
>>>still quality control: Spark 3.0 has already had tons of changes and 
>>> major
>>>component version upgrades that are subject to all kinds of known and
>>>hidden regressions. Having Hadoop 2.7 there provides us a safety net, 
>>> since
>>>it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 
>>> 2.7
>>>to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in 
>>> the
>>>next 1 or 2 Spark 3.x releases.
>>>
>>> Cheng
>>>
>>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers  wrote:
>>>
 i get that cdh and hdp backport a lot and in that way left 2.7 behind.
 but they kept the public apis stable at the 2.7 level, because thats kind
 of the point. arent those the hadoop apis spark uses?

 On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
  wrote:

>
>
> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>  wrote:
>>
>>> It would be really good if the spark distributions shipped with
>>> later versions of the hadoop artifacts.
>>>
>>
>> I second this. If we need to keep a Hadoop 2.x profile around, why
>> not make it Hadoop 2.8 or something newer?
>>
>
> go for 2.9
>
>>
>> Koert Kuipers  wrote:
>>
>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>>> profile to latest would probably be an issue for us.
>>
>>
>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>
>
> The internal builds of CDH and HDP are not those of ASF 2.7.x. A
> really large proportion of the later branch-2 patches are backported. 2,7
> was left behind a long time ago
>
>
>
>


Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why
we haven't spend resources to implement codegen for it.

On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang  wrote:

> There are some cases, shuffle hash join performs even better than sort
> merge join.
>
> While, I noticed that ShuffledHashJoinExec does not implement
> CodegenSupport, is there any concern? And if there is any chance to improve
> the performance of ShuffledHashJoinExec?
>
>
>


Re: Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread Wenchen Fan
You can only know the actual data size of your RDD in memory if you
serialize your data objects to binaries. Otherwise you can only estimate
the size of the data objects in JVM.

On Tue, Dec 3, 2019 at 5:58 AM zhangliyun  wrote:

>
>
>
>
>
>
>  转发邮件信息 
> 发件人:"zhangliyun" 
> 发送日期:2019-12-03 05:56:55
> 收件人:"Wenchen Fan" 
> 主题:Re:Re: A question about radd bytes size
>
> Hi Fan:
>thanks for reply,  I agree that the how the data is stored decides the
> total bytes of the table file.
> In my experiment,  I found that
> sequence file with gzip compress is 0.5x of the total byte size calculated
> in memory.
> parquet file with lzo compress is 0.2x of the total byte size calculated
> in memory.
>
> Here the reason why  actual hive table size is  less than total size
> calculated in memory is decided by format sequence, orc, parquet and others.
> Or is decided by compress algorithm Or both?
>
>
> Meanwhile can I directly use org.apache.spark.util.SizeEstimator.estimate(RDD)
> to estimate the total size of a rdd? I guess there is some difference
> between the actual size and estimated size. So in which case, we can use or
> in which case we can not use.
>
> Best Regards
> Kelly Zhang
>
>
>
>
> 在 2019-12-02 15:54:19,"Wenchen Fan"  写道:
>
> When we talk about bytes size, we need to specify how the data is stored.
> For example, if we cache the dataframe, then the bytes size is the number
> of bytes of the binary format of the table cache. If we write to hive
> tables, then the bytes size is the total size of the data files of the
> table.
>
> On Mon, Dec 2, 2019 at 1:06 PM zhangliyun  wrote:
>
>> Hi:
>>
>>  I want to get the total bytes of a DataFrame by following function , but
>> when I insert the DataFrame into hive , I found the value of the function
>> is different from spark.sql.statistics.totalSize .  The
>> spark.sql.statistics.totalSize  is less than the result of following
>> function getRDDBytes .
>>
>>def getRDDBytes(df:DataFrame):Long={
>>
>>
>>   df.rdd.getNumPartitions match {
>> case 0 =>
>>   0
>> case numPartitions =>
>>   val rddOfDataframe = 
>> df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>>   val size = if (rddOfDataframe.isEmpty()) {
>> 0
>>   } else {
>> rddOfDataframe.reduce(_ + _)
>>   }
>>
>>   size
>>   }
>> }
>> Appreciate if you can provide your suggestion.
>>
>> Best Regards
>> Kelly Zhang
>>
>>
>>
>>
>>
>
>
>
>
>
>
>


Re: Slower than usual on PRs

2019-12-02 Thread Wenchen Fan
Sorry to hear that. Hope you get better soon!

On Tue, Dec 3, 2019 at 1:28 AM Holden Karau  wrote:

> Hi Spark dev folks,
>
> Just an FYI I'm out dealing with recovering from a motorcycle accident so
> my lack of (or slow) responses on PRs/docs is health related and please
> don't block on any of my reviews. I'll do my best to find some OSS cycles
> once I get back home.
>
> Cheers,
>
> Holden
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-04 Thread Wenchen Fan
+1, I think it's good for both end-users and Spark developers:
* for end-users, when they lookup a table, they don't need to care which
command triggers it, as the behavior is consistent in all the places.
* for Spark developers, we may simplify the code quite a bit. For now we
have two code paths to lookup tables: one for SELECT/INSERT and one for
other commands.

Thanks,
Wenchen

On Mon, Dec 2, 2019 at 9:12 AM Terry Kim  wrote:

> Hi all,
>
> As discussed in SPARK-29900, Spark currently has two different relation
> resolution behaviors:
>
>1. Look up temp view first, then table/persistent view
>2. Look up table/persistent view
>
> The first behavior is used in SELECT, INSERT and a few commands that
> support temp views such as DESCRIBE TABLE, etc. The second behavior is used
> in most commands. Thus, it is hard to predict which relation resolution
> rule is being applied for a given command.
>
> I want to propose a consistent relation resolution behavior in which temp
> views are always looked up first before table/persistent view, as
> described more in detail in this doc: consistent relation resolution
> proposal
> <https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing>
> .
>
> Note that this proposal is a breaking change, but the impact should be
> minimal since this applies only when there are temp views and tables with
> the same name.
>
> Any feedback will be appreciated.
>
> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon
> Hyun for guidance and suggestion.
>
> Regards,
> Terry
>
>
> <https://issues.apache.org/jira/browse/SPARK-29900>
>


Re: [DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Wenchen Fan
PartitionReader extends Closable, seems reasonable to me to do the same
for DataWriter.

On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim 
wrote:

> Hi devs,
>
> I'd like to propose to add close() on DataWriter explicitly, which is the
> place for resource cleanup.
>
> The rationalization of the proposal is due to the lifecycle of DataWriter.
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter
> instance ends at either commit() or abort(). That makes datasource
> implementors to feel they can place resource cleanup in both sides, but
> abort() can be called when commit() fails; so they have to ensure they
> don't do double-cleanup if cleanup is not idempotent.
>
> I've checked some callers to see whether they can apply
> "try-catch-finally" to ensure close() is called at the end of lifecycle for
> DataWriter, and they look like so, but I might be missing something.
>
> What do you think? It would bring backward incompatible change, but given
> the interface is marked as Evolving and we're making backward incompatible
> changes in Spark 3.0, so I feel it may not matter.
>
> Would love to hear your thoughts.
>
> Thanks in advance,
> Jungtaek Lim (HeartSaVioR)
>
>
>


Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Wenchen Fan
Sounds good. Thanks for bringing this up!

On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro 
wrote:

> That looks nice, thanks!
> I checked the previous v2.4.4 release; it has around 130 commits (from
> 2.4.3 to 2.4.4), so
> I think branch-2.4 already has enough commits for the next release.
>
> A commit list from 2.4.3 to 2.4.4;
>
> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...7955b3962ac46b89564e0613db7bea98a1478bf2
>
> Bests,
> Takeshi
>
> On Tue, Dec 10, 2019 at 3:32 AM Sean Owen  wrote:
>
>> Sure, seems fine. The release cadence slows down in a branch over time
>> as there is probably less to fix, so Jan-Feb 2020 for 2.4.5 and
>> something like middle or Q3 2020 for 2.4.6 is a reasonable
>> expectation. It might plausibly be the last 2.4.x release but who
>> knows.
>>
>> On Mon, Dec 9, 2019 at 12:29 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Along with the discussion on 3.0.0, I'd like to discuss about the next
>> releases on `branch-2.4`.
>> >
>> > As we know, `branch-2.4` is our LTS branch and also there exists some
>> questions on the release plans. More releases are important not only for
>> the latest K8s version support, but also for delivering important bug fixes
>> regularly (at least until 3.x becomes dominant.)
>> >
>> > In short, I'd like to propose the followings.
>> >
>> > 1. Apache Spark 2.4.5 release (2020 January)
>> > 2. Apache Spark 2.4.6 release (2020 July)
>> >
>> > Of course, we can adjust the schedule.
>> > This aims to have a pre-defined cadence in order to give release
>> managers to prepare.
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> > PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks
like an end-less job to make sure Spark JDBC source supports all databases.

On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:

> You can follow how we test the other JDBC dialects. All JDBC dialects
> require the docker integration tests.
> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>
>
> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
> wrote:
>
>> Hi, to answer both questions raised:
>>
>>
>>
>> Though Vertica is derived from Postgres, Vertica does not recognize type
>> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
>> enough to cause issues.  The major changes are to use type names and date
>> format supported by Vertica.
>>
>>
>>
>> For testing, I have a SQL script plus Scala and PySpark scripts, but
>> these require a Vertica database to connect, so automated testing on a
>> build server wouldn’t work.  It’s possible to include my test scripts and
>> directions to run manually, but not sure where in the repo that would go.
>> If automated testing is required, I can ask our engineers whether there
>> exists something like a mockito that could be included.
>>
>>
>>
>> Thanks, Bryan H
>>
>>
>>
>> *From:* Xiao Li [mailto:lix...@databricks.com]
>> *Sent:* Wednesday, December 11, 2019 10:13 AM
>> *To:* Sean Owen 
>> *Cc:* Bryan Herger ; dev@spark.apache.org
>> *Subject:* Re: I would like to add JDBCDialect to support Vertica
>> database
>>
>>
>>
>> How can the dev community test it?
>>
>>
>>
>> Xiao
>>
>>
>>
>> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>>
>> It's probably OK, IMHO. The overhead of another dialect is small. Are
>> there differences that require a new dialect? I assume so and might
>> just be useful to summarize them if you open a PR.
>>
>> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>>  wrote:
>> >
>> > Hi, I am a Vertica support engineer, and we have open support requests
>> around NULL values and SQL type conversion with DataFrame read/write over
>> JDBC when connecting to a Vertica database.  The stack traces point to
>> issues with the generic JDBCDialect in Spark-SQL.
>> >
>> > I saw that other vendors (Teradata, DB2...) have contributed a
>> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
>> for Vertica.
>> >
>> > The changeset is on my fork of apache/spark at
>> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>> >
>> > I have tested this against Vertica 9.3 and found that this changeset
>> addresses both issues reported to us (issue with NULL values - setNull() -
>> for valid java.sql.Types, and String to VARCHAR conversion)
>> >
>> > Is the an acceptable change?  If so, how should I go about submitting a
>> pull request?
>> >
>> > Thanks, Bryan Herger
>> > Vertica Solution Engineer
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>>
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
I also share the concerns of "writing twice", which hurts performance a
lot. What's worse, the final write may not be scalable, like writing the
staging table to the final table.

If the sink itself doesn't support global transaction, but only local
transaction (e.g. kafla), using staging tables seems the only way to make
the write atomic. But if we accept "eventually consistent", we can use 2PC
to avoid "writing twice". i.e. wait for all the tasks to finish
writing data, then ask them to commit at the same time.

There are 2 open questions we need to answer:
1. How to make sure all tasks are launched at the same time to implement
2PC? barrier execution?
2. To reach "eventually consistent", we must retry the job until successe.
How shall we guarantee the job retry?

On Fri, Oct 19, 2018 at 12:25 PM Jungtaek Lim  wrote:

> Sorry to resurrect this old and long thread: we have been struggling with
> Kafka end-to-end exactly-once support, and couldn't find any approach which
> can get both things, transactional and scalable.
>
> If we tolerate scalability, we can let writers to write to staging topic
> within individual transaction, and once all writers are succeed to write,
> let coordinator re-read topic and write to final topic in its transaction.
> (Coordinator should be able to get offsets to read and skip offsets which
> are being left due to failed trial of batch, so shouldn't be read-committed
> while reading.)
>
> This will result in twice of data writing as well as all rows being
> published within single thread in final step, which most of the cases we
> can't tolerate. Moreover, repartitioning to 1 and write with enabling
> transaction might do the same thing better: staging topic vs shuffle.
>
> If we tolerate transaction (I meant "global transaction") and allow
> "eventually consistent" - allow part(s) of output being seen to client side
> at specific point (2PC has same limitation), there could be some approaches
> which might be tricky but work.
>
> Even external storages support transaction, normally it doesn't mean they
> are supporting global transaction or grouped transactions. The boundary of
> transaction is mostly limited to the one client (connection), and once we
> want to write from multiple tasks, we will encounter same issue on these
> external storages, except the cases (like HDFS) which can "move" its data
> from staging to final destination within storage.
>
> So could we consider lessen the contract on DataSource V2 writer, or have
> a new representation of guarantee for such case so it is not "fully
> transactional" but another kind of "exactly-once" and not "at-least-once"?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 9월 14일 (금) 오전 12:08, Thakrar, Jayesh 님이
> 작성:
>
>> Agree on the “constraints” when working with Cassandra.
>>
>> But remember, this is a weak attempt to make two non-transactional
>> systems appear to the outside world as a transactional system.
>>
>> Scaffolding/plumbing/abstractions will have to be created in the form of
>> say, a custom data access layer.
>>
>>
>>
>> Anyway, Ross is trying to get some practices used by other adopters of
>> the V2 API while trying to implement a driver/connector for MongoDB.
>>
>>
>>
>> Probably views can be used similar to partitions in mongoDB?
>>
>> Essentially each batch load goes into a separate mongoDB table and will
>> result in view redefinition after a successful load.
>>
>> And finally to avoid too many tables in a view, you may have to come up
>> with a separate process to merge the underlying tables on a periodic basis.
>>
>> It gets messy and probably moves you towards a write-once only tables,
>> etc.
>>
>>
>>
>> Finally using views in a generic mongoDB connector may not be good and
>> flexible enough.
>>
>>
>>
>>
>>
>> *From: *Russell Spitzer 
>> *Date: *Tuesday, September 11, 2018 at 9:58 AM
>> *To: *"Thakrar, Jayesh" 
>> *Cc: *Arun Mahadevan , Jungtaek Lim ,
>> Wenchen Fan , Reynold Xin ,
>> Ross Lawley , Ryan Blue , dev <
>> dev@spark.apache.org>, "dbis...@us.ibm.com" 
>>
>>
>> *Subject: *Re: DataSourceWriter V2 Api questions
>>
>>
>>
>> That only works assuming that Spark is the only client of the table. It
>> will be impossible to force an outside user to respect the special metadata
>> table when reading so they will still see all of the data in transit.
>> Additionally this would force the incoming data to only be written into new
>>

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Wenchen Fan
Does anybody remember what we did for 2.0 preview? Personally I'd like to
avoid cutting branch-3.0 right now, otherwise we need to merge PRs into two
branches in the following several months.

Thanks,
Wenchen

On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang  wrote:

> Hi Dongjoon,
>
> I'm not sure about the best practice of maintaining a preview release
> branch, since new features might still go into Spark 3.0 after preview
> release, I guess it might make more sense to have separated  branches for
> 3.0.0 and 3.0-preview.
>
> However, I'm open to both solutions, if we really want to reuse the branch
> to also release Spark 3.0.0, then I would be happy to create a new one.
>
> Thanks!
>
> Xingbo
>
> Dongjoon Hyun  于2019年10月16日周三 上午6:26写道:
>
>> Hi,
>>
>> It seems that we have `branch-3.0-preview` branch.
>>
>> https://github.com/apache/spark/commits/branch-3.0-preview
>>
>> Can we have `branch-3.0` instead of `branch-3.0-preview`?
>>
>> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
>> `v3.0.0` later.
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to
implement left/right join" this should be "left/right non-equal join"

On Thu, Oct 24, 2019 at 6:32 AM zhangliyun  wrote:

>
> Hi Herman:
>I guess what you mentioned before
> ```
> if you are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> ```
> is the NULL key1 of  left table will be retained if NULL key2 is not found
> in the right table  ( join condition :  left.key1 = right.key2)  in exists
> semantics while this will not happen in
> "in semantics". If my understanding wrong, tell me.
>
>
>
> Best Regards.
>
> Kelly Zhang
>
>
>
>
>
>
>
>
> 在 2019-10-23 19:16:34,"Herman van Hovell"  写道:
>
> In some cases BroadcastNestedLoopJoin is the only viable join method. In
> your example for instance you are using a non-equi join condition and BNLJ
> is the only method that works in that case. This is also the reason why you
> can't disable it using the spark.sql.autoBroadcastJoinThreshold
> configuration.
>
> Such a plan is generally generated by using a NOT IN (subquery), if you
> are OK with slightly different NULL semantics then you could use NOT
> EXISTS(subquery). The latter should perform a lot better.
>
> On Wed, Oct 23, 2019 at 12:02 PM zhangliyun  wrote:
>
>> Hi all:
>> i want to ask a question about broadcast nestloop join? from google i
>> know, that
>>  left outer/semi join and right outer/semi join will use broadcast
>> nestloop.
>>   and in some cases, when the input data is very small, it is suitable to
>> use. so here
>>   how to define the input data very small? what parameter decides the
>> threshold?  I just want to disable it ( i found that   set
>> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>>
>>
>> ```
>>
>> explain cost select a.key1  from testdata1 as a where a.key1 not in
>> (select key3 from testdata3);
>>
>> == Physical Plan ==
>> *(1) Project [key1#90]
>> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
>> isnull((key1#90 = key3#92)))
>>:- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>>+- BroadcastExchange IdentityBroadcastMode
>>   +- HiveTableScan [key3#92], HiveTableRelation
>> `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [key3#92, value3#93]
>>
>> ```
>>
>>   my question is
>>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even
>> i set spark.sql.autoBroadcastJoinThreshold= -1
>>   2. which spark parameter  decides enable/disable
>> BroadcastNestedLoopJoin.
>>
>>
>>
>> Appreciate if you have suggestion
>>
>>
>> Best Regards
>>
>> Kelly Zhang
>>
>>
>>
>>
>
>
>
>


Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
I haven't looked into your query yet, just want to let you know that: Spark
can only pick BroadcastNestedLoopJoin to implement left/right join. If the
table is very big, then OOM happens.

Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal
with it using broadcast.

On Wed, Oct 23, 2019 at 6:02 PM zhangliyun  wrote:

> Hi all:
> i want to ask a question about broadcast nestloop join? from google i
> know, that
>  left outer/semi join and right outer/semi join will use broadcast
> nestloop.
>   and in some cases, when the input data is very small, it is suitable to
> use. so here
>   how to define the input data very small? what parameter decides the
> threshold?  I just want to disable it ( i found that   set
> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>
>
> ```
>
> explain cost select a.key1  from testdata1 as a where a.key1 not in
> (select key3 from testdata3);
>
> == Physical Plan ==
> *(1) Project [key1#90]
> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
> isnull((key1#90 = key3#92)))
>:- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>+- BroadcastExchange IdentityBroadcastMode
>   +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]
>
> ```
>
>   my question is
>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even i
> set spark.sql.autoBroadcastJoinThreshold= -1
>   2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.
>
>
>
> Appreciate if you have suggestion
>
>
> Best Regards
>
> Kelly Zhang
>
>
>
>


Re: Apache Spark 3.0 timeline

2019-10-16 Thread Wenchen Fan
>  I figure we are probably moving to code freeze late in the year, release
early next year?

Sounds good!

On Thu, Oct 17, 2019 at 7:51 AM Dongjoon Hyun 
wrote:

> Thanks! That sounds reasonable. I'm +1. :)
>
> Historically, 2.0-preview was on May 2016 and 2.0 was on July, 2016. 3.0
> seems to be different.
>
> Bests,
> Dongjoon
>
> On Wed, Oct 16, 2019 at 16:38 Sean Owen  wrote:
>
>> I think the branch question is orthogonal but yeah we can probably make
>> an updated statement about 3.0 release. Clearly a preview is imminent. I
>> figure we are probably moving to code freeze late in the year, release
>> early next year? Any better ideas about estimates to publish? They aren't
>> binding.
>>
>> On Wed, Oct 16, 2019, 4:01 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I saw the following comment from Wenchen in the previous email thread.
>>>
>>> > Personally I'd like to avoid cutting branch-3.0 right now, otherwise
>>> we need to merge PRs into two branches in the following several months.
>>>
>>> Since 3.0.0-preview seems to be already here for RC, can we update our
>>> timeline in the official web page accordingly.
>>>
>>> http://spark.apache.org/versioning-policy.html
>>>
>>> -
>>> Spark 2.4 Release Window
>>> Date Event
>>> Mid Aug 2018   Code freeze. Release branch cut.
>>> Late Aug 2018   QA period. Focus on bug fixes, tests, stability and
>>> docs. Generally, no new features merged.
>>> Early Sep 2018   Release candidates (RC), voting, etc. until final
>>> release passes
>>>
>>


Re: DataSourceV2 sync notes - 2 October 2019

2019-10-18 Thread Wenchen Fan
Hi Ryan,

Thanks for summarizing and sending out the notes! I've created the JIRA
ticket to add v2 statements for all the commands that need to resolve a
table: https://issues.apache.org/jira/browse/SPARK-29481

Contributions to it are appreciated!

Thanks,
Wenchen

On Fri, Oct 11, 2019 at 7:05 AM Ryan Blue  wrote:

> Here are my notes from last week's DSv2 sync.
>
> *Attendees*:
>
> Ryan Blue
> Terry Kim
> Wenchen Fan
>
> *Topics*:
>
>- SchemaPruning only supports Parquet and ORC?
>- Out of order optimizer rules
>- 3.0 work
>   - Rename session catalog to spark_catalog
>   - Finish TableProvider update to avoid another API change: pass all
>   table config from metastore
>   - Catalog behavior fix:
>   https://issues.apache.org/jira/browse/SPARK-29014
>   - Stats push-down optimization:
>   https://github.com/apache/spark/pull/25955
>   - DataFrameWriter v1/v2 compatibility progress
>- Open PRs
>   - Update identifier resolution and table resolution:
>   https://github.com/apache/spark/pull/25747
>   - Expose SerializableConfiguration:
>   https://github.com/apache/spark/pull/26005
>   - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955
>
> *Discussion*:
>
>- Update identifier and table resolution
>   - Wenchen: Will not handle SPARK-29014, it is a pure refactor
>   - Ryan: I think this should separate the v2 rules from the v1
>   fallback, to keep table and identifier resolution separate. The only 
> time
>   that table resolution needs to be done at the same time is for v1 
> fallback.
>   - This was merged last week
>- Update to use spark_catalog
>   - Wenchen: this will be a separate PR.
>   - Now open: https://github.com/apache/spark/pull/26071
>- Early DSv2 pushdown
>   - Ryan: this depends on fixing a few more tests. To validate there
>   are no calls to computeStats with the DSv2 relation, I’ve temporarily
>   removed the method. Other than a few remaining test failures where the 
> old
>   relation was expected, it looks like there are no uses of computeStats
>   before early pushdown in the optimizer.
>   - Wenchen: agreed that the batch was in the correct place in the
>   optimizer
>   - Ryan: once tests are passing, will add the computeStats
>   implementation back with Utils.isTesting to fail during testing when 
> called
>   before early pushdown, but will not fail at runtime
>- Wenchen: when using v2, there is no way to configure custom options
>for a JDBC table. For v1, the table was created and stored in the session
>catalog, at which point Spark-specific properties like parallelism could be
>stored. In v2, the catalog is the source of truth, so tables don’t get
>created in the same way. Options are only passed in a create statement.
>   - Ryan: this could be fixed by allowing users to pass options as
>   table properties. We mix the two today, but if we used a prefix for 
> table
>   properties, “options.”, then you could use SET TBLPROPERTIES to get 
> around
>   this. That’s also better for compatibility. I’ll open a PR for this.
>   - Ryan: this could also be solved by adding an OPTIONS clause or
>   hint to SELECT
>- Wenchen: There are commands without v2 statements. We should add v2
>statements to reject non-v1 uses.
>   - Ryan: Doesn’t the parser only parse up to 2 identifiers for
>   these? That would handle the majority of cases
>   - Wenchen: Yes, but there is still a problem for identifiers with 1
>   part in v2 catalogs, like catalog.table. Commands that don’t support v2
>   will use catalog.table in the v1 catalog.
>   - Ryan: Sounds like a good plan to update the parser and add
>   statements for these. Do we have a list of commands to update?
>   - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION,
>   etc. Will open an umbrella JIRA with a list.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
Hi all,

Recently we start an effort to achieve feature parity between Spark and
PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules,
built-in functions, etc.) to Spark, and also corrected several
inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
2.1 Spark's behavior doesn't make sense: change it to follow SQL
standard and PostgreSQL, with a legacy config to restore the behavior.
2.2 Spark's behavior makes sense but violates SQL standard: change the
behavior to follow SQL standard and PostgreSQL, when the ansi mode is
enabled (default false).
2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds
the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate
PostgreSQL workloads to Spark. Other databases have this strategy too. For
example, DB2 provides an oracle dialect

.

However, there are so many differences between Spark and PostgreSQL,
including SQL parsing, type coercion, function/operator behavior, data
types, etc. I'm afraid that we may spend a lot of effort on it, and make
the Spark codebase pretty complicated, but still not able to provide a
usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of
migrating PostgreSQL workloads. I think it's much more important to make
Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our
own cast function is not ANSI-compliant yet. This makes me think that, we
should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from
the codebase before it's too late. Curently we only have 3 features under
PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but
return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL.
(there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's
behavior violates SQL standard. But for others, let's just update the
answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen


Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored.
For example, if we cache the dataframe, then the bytes size is the number
of bytes of the binary format of the table cache. If we write to hive
tables, then the bytes size is the total size of the data files of the
table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun  wrote:

> Hi:
>
>  I want to get the total bytes of a DataFrame by following function , but
> when I insert the DataFrame into hive , I found the value of the function
> is different from spark.sql.statistics.totalSize .  The
> spark.sql.statistics.totalSize  is less than the result of following
> function getRDDBytes .
>
>def getRDDBytes(df:DataFrame):Long={
>
>
>   df.rdd.getNumPartitions match {
> case 0 =>
>   0
> case numPartitions =>
>   val rddOfDataframe = 
> df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>   val size = if (rddOfDataframe.isEmpty()) {
> 0
>   } else {
> rddOfDataframe.reduce(_ + _)
>   }
>
>   size
>   }
> }
> Appreciate if you can provide your suggestion.
>
> Best Regards
> Kelly Zhang
>
>
>
>
>


Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-07 Thread Wenchen Fan
AFAIK there is no public streaming data source API before DS v2. The Source
and Sink API is private and is only for builtin streaming sources. Advanced
users can still implement custom stream sources with private Spark APIs
(you can put your classes under the org.apache.spark.sql package to access
the private methods).

That said, DS v2 is the first public streaming data source API. It's really
hard to design a stable, efficient and flexible data source API that is
unified between batch and streaming. DS v2 has evolved a lot in the master
branch and hopefully there will be no big breaking changes anymore.


On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim 
wrote:

> I remembered the actual case from developer who implements custom data
> source.
>
>
> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>
> Quoting here:
> We started implementing DSv2 in the 2.4 branch, but quickly discovered
> that the DSv2 in 3.0 was a complete breaking change (to the point where it
> could have been named DSv3 and it wouldn’t have come as a surprise). Since
> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
> to fall back into DSv1 in order to ease the future transition to Spark 3.
>
> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution on
> dealing with DSv2 breaking change is having DSv1 as temporary solution,
> even DSv2 for 3.x will be available. They need some time to make transition.
>
> I would file an issue to support streaming data source on DSv1 and submit
> a patch unless someone objects.
>
>
> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski  wrote:
>
>> Hi Jungtaek,
>>
>> Thanks a lot for your very prompt response!
>>
>> > Looks like it's missing, or intended to force custom streaming source
>> implemented as DSv2.
>>
>> That's exactly my understanding = no more DSv1 data sources. That however
>> is not consistent with the official message, is it? Spark 2.4.4 does not
>> actually say "we're abandoning DSv1", and people could not really want to
>> jump on DSv2 since it's not recommended (unless I missed that).
>>
>> I love surprises (as that's where people pay more for consulting :)), but
>> not necessarily before public talks (with one at SparkAISummit in two
>> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>>
>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim 
>> wrote:
>>
>>> Looks like it's missing, or intended to force custom streaming source
>>> implemented as DSv2.
>>>
>>> I'm not sure Spark community wants to expand DSv1 API: I could propose
>>> the change if we get some supports here.
>>>
>>> To Spark community: given we bring major changes on DSv2, someone would
>>> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
>>> new DSv2 gets stabilized. Would we like to provide necessary changes on
>>> DSv1?
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski  wrote:
>>>
 Hi,

 I think I've got stuck and without your help I won't move any further.
 Please help.

 I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
 MicroBatch) and in getBatch phase when requested for a DataFrame, there is
 this assert [1] I can't seem to go past with any DataFrame I managed to
 create as it's not streaming.

   assert(batch.isStreaming,
 s"DataFrame returned by getBatch from $source did not have
 isStreaming=true\n" +
   s"${batch.queryExecution.logical}")

 [1]
 https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441

 All I could find is private[sql],
 e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]

 [2]
 https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
 [3]
 https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81

 Pozdrawiam,
 Jacek Laskowski
 
 https://about.me/JacekLaskowski
 The Internals of Spark SQL https://bit.ly/spark-sql-internals
 The Internals of Spark Structured Streaming
 https://bit.ly/spark-structured-streaming
 The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
 Follow me at https://twitter.com/jaceklaskowski




Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Wenchen Fan
+1

I think this is the most reasonable default behavior among the three.

On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> +1 (non-binding)
>
> I have been following this standardization effort and I think it is sound
> and it provides the needed flexibility via the option.
>
> Best regards,
> Alessandro
>
> On Mon, 7 Oct 2019 at 10:24, Gengliang Wang 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a new vote on SPARK-28885
>>  "Follow ANSI store
>> assignment rules in table insertion by default" after revising the ANSI
>> store assignment policy(SPARK-29326
>> ).
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the store
>> assignment rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>> certain unreasonable type conversions such as converting `string` to `int`
>> and `double` to `boolean`. It will throw a runtime exception if the value
>> is out-of-range(overflow).
>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>> for compatibility with Hive. When inserting an out-of-range value to an
>> integral field, the low-order bits of the value is inserted(the same as
>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>> field of Byte type, the result is 1.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in store assignment, e.g., converting either `double` to `int`
>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no mainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>> and V2 in Spark 3.0.
>>
>> This vote is open until Friday (Oct. 11).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>


Re: [build system] IMPORTANT! northern california fire danger, potential power outage(s)

2019-10-09 Thread Wenchen Fan
Thanks for the updates!

On Thu, Oct 10, 2019 at 5:34 AM Shane Knapp  wrote:

> quick update:
>
> campus is losing power @ 8pm.  this is after we were told 4am, 8am,
> noon, and 2-4pm.  :)
>
> PG expects to start bringing alameda county back online at noon
> tomorrow, but i believe that target to be fluid and take longer than
> expected.
>
> this means that the earliest that we can bring the build system back
> up is friday, but there's a much greater than non-zero chance of this
> not happening until monday morning.  i will be leaving town for the
> weekend friday afternoon, which means i won't be physically present to
> turn on all of our servers in the colo (about ~80 servers including
> jenkins) until monday.
>
> more updates as they come.  thanks for your patience!
>
> On Tue, Oct 8, 2019 at 7:32 PM Shane Knapp  wrote:
> >
> > jenkins is going down now.
> >
> > On Tue, Oct 8, 2019 at 4:21 PM Shane Knapp  wrote:
> > >
> > > quick update:
> > >
> > > we are definitely going to have our power shut off starting early
> > > tomorrow morning (by 4am PDT oct 9th), and expect at least 48 hours
> > > before it is restored.
> > >
> > > i will be shutting jenkins down some time this evening, and will
> > > update everyone here when i get more information.
> > >
> > > full service will be restored (i HOPE) by friday morning.
> > >
> > > shane (who doesn't ever want to check this list's archives and count
> > > how many times we've had power issues)
> > >
> > > On Tue, Oct 8, 2019 at 12:50 PM Shane Knapp 
> wrote:
> > > >
> > > > here in the lovely bay area, we are currently experiencing some
> > > > absolutely lovely weather:  temps around 20C, light winds, and not a
> > > > drop of moisture anywhere.
> > > >
> > > > this means that wildfire season is here, and our utilities company
> > > > (PG) is very concerned about fires like last year's Camp Fire
> > > > (https://en.wikipedia.org/wiki/Camp_Fire_(2018)), the 2018 fires
> > > > (https://en.wikipedia.org/wiki/2018_California_wildfires) and 2017
> > > > fires (https://en.wikipedia.org/wiki/2017_California_wildfires).
> > > >
> > > > because conditions are absolutely perfect for wildfires, we may lose
> > > > power here in berkeley tomorrow and thursday.
> > > >
> > > > there will be little to no notice of then this might happen, and if
> it
> > > > does that means that jenkins will most definitely go down.
> > > >
> > > > i will continue to keep a close eye on this and give updates as they
> > > > happen.  sadly, the pg website is down because they apparently
> > > > didn't think that they needed load balancers.  :\
> > > >
> > > > shane
> > > > --
> > > > Shane Knapp
> > > > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > > > https://rise.cs.berkeley.edu
> > >
> > >
> > >
> > > --
> > > Shane Knapp
> > > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > > https://rise.cs.berkeley.edu
> >
> >
> >
> > --
> > Shane Knapp
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Wenchen Fan
Regarding DS v2, I'd like to remove
SPARK-26785  data source
v2 API refactor: streaming write
SPARK-26956  remove
streaming output mode from data source v2 APIs

and put the umbrella ticket instead
SPARK-25390  data source
V2 API refactoring

Thanks,
Wenchen

On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun 
wrote:

> Thank you for the preparation of 3.0-preview, Xingbo!
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang  wrote:
>
>>  What's the process to propose a feature to be included in the final
>>> Spark 3.0 release?
>>>
>>
>> I don't know whether there exists any specific process here, normally you
>> just merge the feature into Spark master before release code freeze, and
>> then the feature would probably be included in the release. The code freeze
>> date for Spark 3.0 has not been decided yet, though.
>>
>> Li Jin  于2019年10月8日周二 下午2:14写道:
>>
>>> Thanks for summary!
>>>
>>> I have a question that is semi-related - What's the process to propose a
>>> feature to be included in the final Spark 3.0 release?
>>>
>>> In particular, I am interested in
>>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do
>>> the work so want to make sure I don't miss the "cut" date.
>>>
>>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang 
>>> wrote:
>>>
 Hi all,

 Thanks for all the feedbacks, here is the updated feature list:

 SPARK-11215 
 Multiple columns support added to various Transformers: StringIndexer

 SPARK-11150 
 Implement Dynamic Partition Pruning

 SPARK-13677 
 Support Tree-Based Feature Transformation

 SPARK-16692  Add
 MultilabelClassificationEvaluator

 SPARK-19591  Add
 sample weights to decision trees

 SPARK-19712 
 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
 Union etc.

 SPARK-19827  R API
 for Power Iteration Clustering

 SPARK-20286 
 Improve logic for timing out executors in dynamic allocation

 SPARK-20636 
 Eliminate unnecessary shuffle with adjacent Window expressions

 SPARK-22148 
 Acquire new executors to avoid hang because of blacklisting

 SPARK-22796 
 Multiple columns support added to various Transformers: PySpark
 QuantileDiscretizer

 SPARK-23128  A new
 approach to do adaptive execution in Spark SQL

 SPARK-23155  Apply
 custom log URL pattern for executor log URLs in SHS

 SPARK-23539  Add
 support for Kafka headers

 SPARK-23674  Add
 Spark ML Listener for Tracking ML Pipeline Status

 SPARK-23710 
 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

 SPARK-24333  Add
 fit with validation set to Gradient Boosted Trees: Python API

 SPARK-24417  Build
 and Run Spark on JDK11

 SPARK-24615 
 Accelerator-aware task scheduling for Spark

 SPARK-24920  Allow
 sharing Netty's memory pool allocators

 SPARK-25250  Fix
 race condition with tasks running when new attempt for same stage is
 created leads to other task in the next attempt running on the same
 partition id retry multiple times

 SPARK-25341 
 Support rolling back a shuffle map stage and re-generate the shuffle files

 SPARK-25348  Data
 source for binary files

 SPARK-25501  Add
 kafka delegation token support

 SPARK-25603 
 Generalize Nested Column Pruning

 SPARK-26132  Remove

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-08 Thread Wenchen Fan
> Would you mind if I ask the condition of being public API?

The module doesn't matter, but the package matters. We have many public
APIs in the catalyst module as well. (e.g. DataType)

There are 3 packages in Spark SQL that are meant to be private:
1. org.apache.spark.sql.catalyst
2. org.apache.spark.sql.execution
3. org.apache.spark.sql.internal

You can check out the full list of private packages of Spark in
project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages

Basically, classes/interfaces that don't appear in the official Spark API
doc are private.

Source/Sink traits are in org.apache.spark.sql.execution and thus they are
private.

On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim 
wrote:

> Would you mind if I ask the condition of being public API? Source/Sink
> traits are not marked as @DeveloperApi but they're defined as public, and
> located to sql-core so even not semantically private (for catalyst), easy
> to give a signal they're public APIs.
>
> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row]
> is not available even for private API. There're some other approaches on
> using private API: 1) SQLContext.internalCreateDataFrame - as it requires
> RDD[InternalRow], they should also depend on catalyst and have to deal with
> InternalRow which Spark community seems to be desired to change it
> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
> catalyst. So they not only need to apply "package hack" but also need to
> depend on catalyst.
>
>
> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan  wrote:
>
>> AFAIK there is no public streaming data source API before DS v2. The
>> Source and Sink API is private and is only for builtin streaming sources.
>> Advanced users can still implement custom stream sources with private Spark
>> APIs (you can put your classes under the org.apache.spark.sql package to
>> access the private methods).
>>
>> That said, DS v2 is the first public streaming data source API. It's
>> really hard to design a stable, efficient and flexible data source API that
>> is unified between batch and streaming. DS v2 has evolved a lot in the
>> master branch and hopefully there will be no big breaking changes anymore.
>>
>>
>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I remembered the actual case from developer who implements custom data
>>> source.
>>>
>>>
>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>>>
>>> Quoting here:
>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered
>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it
>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since
>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
>>> to fall back into DSv1 in order to ease the future transition to Spark 3.
>>>
>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution,
>>> even DSv2 for 3.x will be available. They need some time to make transition.
>>>
>>> I would file an issue to support streaming data source on DSv1 and
>>> submit a patch unless someone objects.
>>>
>>>
>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski  wrote:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> Thanks a lot for your very prompt response!
>>>>
>>>> > Looks like it's missing, or intended to force custom streaming source
>>>> implemented as DSv2.
>>>>
>>>> That's exactly my understanding = no more DSv1 data sources. That
>>>> however is not consistent with the official message, is it? Spark 2.4.4
>>>> does not actually say "we're abandoning DSv1", and people could not really
>>>> want to jump on DSv2 since it's not recommended (unless I missed that).
>>>>
>>>> I love surprises (as that's where people pay more for consulting :)),
>>>> but not necessarily before public talks (with one at SparkAISummit in two
>>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> 
>>>> https://about.me/JacekLaskowski
>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>> The Internals of Spark Structured Streaming
>>>> https://bit.ly/spark-structured-s

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread Wenchen Fan
I'm fine with the view definition proposed here, but my major concern is
how to make sure table/view share the same namespace. According to the SQL
spec, if there is a view named "a", we can't create a table named "a"
anymore.

We can add documents and ask the implementation to guarantee it, but it's
better if this can be guaranteed by the API.

On Wed, Aug 14, 2019 at 1:46 AM John Zhuge  wrote:

> Thanks for the feedback, Ryan! I can share the WIP copy of the SPIP if
> that makes sense.
>
> I can't find out a lot about view resolution and validation in SQL Spec
> Part1. Anybody with full SQL knowledge may chime in.
>
> Here are my understanding based on online manuals, docs, and other
> resources:
>
>- A view has a name in the database schema so that other queries can
>use it like a table.
>- A view's schema is frozen at the time the view is created;
>subsequent changes to underlying tables (e.g. adding a column) will not be
>reflected in the view's schema. If an underlying table is dropped or
>changed in an incompatible fashion, subsequent attempts to query the
>invalid view will fail.
>
> In Preso, view columns are used for validation only (see
> StatementAnalyzer.Visitor#isViewStale):
>
>- view column names must match the visible fields of analyzed view sql
>- the visible fields can be coerced to view column types
>
> In Spark 2.2+, view columns are also used for validation (see
> CheckAnalysis#checkAnalysis case View):
>
>- view column names must match the output fields of the view sql
>- view column types must be able to UpCast to output field types
>
> Rule EliminateView adds a Project to viewQueryColumnNames if it exists.
>
> As for `softwareVersion`, the purpose is to track which software version
> is used to create the view, in preparation for different versions of the
> same software or even different softwares, such as Presto vs Spark.
>
>
> On Tue, Aug 13, 2019 at 9:47 AM Ryan Blue  wrote:
>
>> Thanks for working on this, John!
>>
>> I'd like to see a more complete write-up of what you're proposing.
>> Without that, I don't think we can have a productive discussion about this.
>>
>> For example, I think you're proposing to keep the view columns to ensure
>> that the same columns are produced by the view every time, based on
>> requirements from the SQL spec. Let's start by stating what those behavior
>> requirements are, so that everyone has the context to understand why your
>> proposal includes the view columns. Similarly, I'd like to know why you're
>> proposing `softwareVersion` in the view definition.
>>
>> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge  wrote:
>>
>>> Catalog support has been added to DSv2 along with a table catalog
>>> interface. Here I'd like to propose a view catalog interface, for the
>>> following benefit:
>>>
>>>- Abstraction for view management thus allowing different view
>>>backends
>>>- Disassociation of view definition storage from Hive Metastore
>>>
>>> A catalog plugin can be both TableCatalog and ViewCatalog. Resolve an
>>> identifier as view first then table.
>>>
>>> More details in SPIP and PR if we decide to proceed. Here is a quick
>>> glance at the API:
>>>
>>> ViewCatalog interface:
>>>
>>>- loadView
>>>- listViews
>>>- createView
>>>- deleteView
>>>
>>> View interface:
>>>
>>>- name
>>>- originalSql
>>>- defaultCatalog
>>>- defaultNamespace
>>>- viewColumns
>>>- owner
>>>- createTime
>>>- softwareVersion
>>>- options (map)
>>>
>>> ViewColumn interface:
>>>
>>>- name
>>>- type
>>>
>>>
>>> Thanks,
>>> John Zhuge
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>


Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Wenchen Fan
+1, all tests pass

On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro 
wrote:

> Thanks, Yuming!
>
> I checked the links and the prepared binaries.
> Also, I run tests with  -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver
> -Pmesos -Pkubernetes -Psparkr
> on java version "1.8.0_181.
> All the things above look fine.
>
> Bests,
> Takeshi
>
> On Thu, Dec 19, 2019 at 6:31 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> I also check the signatures and docs. And, built and tested with JDK
>> 11.0.5, Hadoop 3.2, Hive 2.3.
>> In addition, the newly added
>> `spark-3.0.0-preview2-bin-hadoop2.7-hive1.2.tgz` distribution looks correct.
>>
>> Thank you Yuming and all.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Dec 17, 2019 at 4:11 PM Sean Owen  wrote:
>>
>>> Same result as last time. It all looks good and tests pass for me on
>>> Ubuntu with all profiles enables (Hadoop 3.2 + Hive 2.3), building
>>> from source.
>>> 'pyspark-3.0.0.dev2.tar.gz' appears to be the desired python artifact
>>> name, yes.
>>> +1
>>>
>>> On Tue, Dec 17, 2019 at 12:36 AM Yuming Wang  wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0-preview2.
>>> >
>>> > The vote is open until December 20 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview2
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v3.0.0-preview2-rc2 (commit
>>> bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
>>> > https://github.com/apache/spark/tree/v3.0.0-preview2-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1338/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/
>>> >
>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with an out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread Wenchen Fan
Hi Aakash,

You can try the latest DS v2 with the 3.0 preview, and the API is in a
quite stable shape now. With the latest API, a Writer is created from a
Table, and the Table has the partitioning information.

Thanks,
Wenchen

On Wed, Dec 18, 2019 at 3:22 AM aakash aakash 
wrote:

> Thanks Andrew!
>
> It seems there is a drastic change in 3.0, going through it.
>
> -Aakash
>
> On Tue, Dec 17, 2019 at 11:01 AM Andrew Melo 
> wrote:
>
>> Hi Aakash
>>
>> On Tue, Dec 17, 2019 at 12:42 PM aakash aakash 
>> wrote:
>>
>>> Hi Spark dev folks,
>>>
>>> First of all kudos on this new Data Source v2, API looks simple and it
>>> makes easy to develop a new data source and use it.
>>>
>>> With my current work, I am trying to implement a new data source V2
>>> writer with Spark 2.3 and I was wondering how I will get the info about
>>> partition by columns. I see that it has been passed to Data Source V1 from
>>> DataFrameWriter but not for V2.
>>>
>>
>> Not directly related to your Q, but just so you're aware, the DSv2 API
>> evolved from 2.3->2.4 and then again for 2.4->3.0.
>>
>> Cheers
>> Andrew
>>
>>
>>>
>>>
>>> Thanks,
>>> Aakash
>>>
>>


Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising
and I'm looking forward to seeing more feedback from real-world workloads.

On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A  wrote:

> Hi all,
>
> We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance
> tests in 3TB TPC-DS on 5 node Cascade Lake cluster. 2 queries bring about
> more than 1.5x performance and 37 queries bring more than 1.1x performance
> with AQE.  There is no query has significant performance degradations. The
> detail performance results and key configurations are shown in here
> .
> Based on the performance result, we recommend users to turn on AQE in spark
> 3.0. If encounter any bug or improvement when enable AQE, please help to
> file related JIRAs. Thanks.
>
>
>
> Regards,
>
> Jia Ke
>
>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon!

On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon  wrote:

> Thanks Dongjoon!
>
> 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성:
>
>> Happy to hear the release news!
>>
>> Bests,
>> Takeshi
>>
>> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun 
>> wrote:
>>
>>> There was a typo in one URL. The correct release note URL is here.
>>>
>>> https://spark.apache.org/releases/spark-release-2-4-5.html
>>>
>>>
>>>
>>> On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
>>> wrote:
>>>
 We are happy to announce the availability of Spark 2.4.5!

 Spark 2.4.5 is a maintenance release containing stability fixes. This
 release is based on the branch-2.4 maintenance branch of Spark. We
 strongly
 recommend all 2.4 users to upgrade to this stable release.

 To download Spark 2.4.5, head over to the download page:
 http://spark.apache.org/downloads.html

 Note that you might need to clear your browser cache or
 to use `Private`/`Incognito` mode according to your browsers.

 To view the release notes:
 https://spark.apache.org/releases/spark-release-2.4.5.html

 We would like to acknowledge all community members for contributing to
 this
 release. This release would not have been possible without you.

 Dongjoon Hyun

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Wenchen Fan
In general I think it's better to have more detailed documents, but we
don't have to force everyone to do it if the config name is structured. I
would +1 to document the relationship of we can't tell it from the config
names, e.g. spark.shuffle.service.enabled
and spark.dynamicAllocation.enabled.

On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon  wrote:

> Also, I would like to hear other people' thoughts on here. Could I ask
> what you guys think about this in general?
>
> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon 님이 작성:
>
>> To do that, we should explicitly document such structured configuration
>> and implicit effect, which is currently missing.
>> I would be more than happy if we document such implied relationship,
>> *and* if we are very sure all configurations are structured correctly
>> coherently.
>> Until that point, I think it might be more practical to simply document
>> it for now.
>>
>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
>> on practice - whether to duplicate description between configuration code
>> and doc. I have been asked to add description on configuration code
>> regardlessly, and existing codebase doesn't. This configuration is
>> widely-used one.
>> This is actually something we should fix too. in SQL configuration, now
>> we don't have such duplications as of
>> https://github.com/apache/spark/pull/27459 as it generates. We should do
>> it in other configurations.
>>
>>
>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim 님이
>> 작성:
>>
>>> I'm looking into the case of `spark.dynamicAllocation` and this seems to
>>> be the thing to support my voice.
>>>
>>>
>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>
>>> I don't disagree with adding "This requires
>>> spark.shuffle.service.enabled to be set." in the description of
>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly,
>>> hence it should be better to have it.
>>>
>>> Why I'm in favor of structured configuration & implicit effect over
>>> describing everything explicitly is there.
>>>
>>> 1. There're 10 configurations (if the doc doesn't miss any other
>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4
>>> configurations are referred in the description of
>>> `spark.dynamicAllocation.enabled` - majority of config keys are missing.
>>> 2. I think it's intentional, but the table starts
>>> with `spark.dynamicAllocation.enabled` which talks implicitly but
>>> intuitively that if you disable this then everything on dynamic allocation
>>> won't work. Missing majority of references on config keys don't get it hard
>>> to understand.
>>> 3. Even `spark.dynamicAllocation` has bad case - see
>>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
>>> structure of configuration. I think this is worse than not explicitly
>>> mentioning the description. Let's assume the name has
>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
>>> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>
>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
>>> on practice - whether to duplicate description between configuration code
>>> and doc. I have been asked to add description on configuration code
>>> regardlessly, and existing codebase doesn't. This configuration is
>>> widely-used one.
>>>
>>>
>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon 
>>> wrote:
>>>
 Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
 although it might be "redundant" :-) since anyone can give feedback to any
 thread in Spark dev mailing list, and discuss.

 This is actually more prevailing given my rough reading of
 configuration files. I would like to see this missing relationship as a bad
 pattern, started from a personal preference.

 > Personally I'd rather not think someone won't understand setting
 `.enabled` to `false` means the functionality is disabled and effectively
 it disables all sub-configurations.
 > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
 configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this
 is pretty intuitive and the one of major
 > benefits of the structured configurations.

 I don't think this is a good idea we assume for users to know such
 contexts. One might think
 `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
 partially enable the feature. It is better to be explicit to document
 since some of configurations are even difficult for users to confirm if it
 is working or not.
 For instance, one might think setting
 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then,
 they realise the log is not rolling later after the file

[DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
Hi all,

I'd like to discuss the naming policy of Spark configs, as for now it
depends on personal preference which leads to inconsistent namings.

In general, the config name should be a noun that describes its meaning
clearly.
Good examples:
spark.sql.session.timeZone
spark.sql.streaming.continuous.executorQueueSize
spark.sql.statistics.histogram.numBins
Bad examples:
spark.sql.defaultSizeInBytes (default size for what?)

Also note that, config name has many parts, joined by dots. Each part is a
namespace. Don't create namespace unnecessarily.
Good example:
spark.sql.execution.rangeExchange.sampleSizePerPartition
spark.sql.execution.arrow.maxRecordsPerBatch
Bad examples:
spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful
namespace, better to be .buffer.inMemoryThreshold)

For a big feature, usually we need to create an umbrella config to turn it
on/off, and other configs for fine-grained controls. These configs should
share the same namespace, and the umbrella config should be named like
featureName.enabled. For example:
spark.sql.cbo.enabled
spark.sql.cbo.starSchemaDetection
spark.sql.cbo.starJoinFTRatio
spark.sql.cbo.joinReorder.enabled
spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace)
spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace)

For boolean configs, in general it should end with a verb, e.g.
spark.sql.join.preferSortMergeJoin. If the config is for a feature and you
can't find a good verb for the feature, featureName.enabled is also good.

I'll update https://spark.apache.org/contributing.html after we reach a
consensus here. Any comments are welcome!

Thanks,
Wenchen


Re: Datasource V2 support in Spark 3.x

2020-03-05 Thread Wenchen Fan
Data Source V2 has evolved to Connector API which supports both data (the
data source API)  and metadata (the catalog API). The new APIs are under
package org.apache.spark.sql.connector

You can keep using Data Source V1 as there is no plan to deprecate it in
the near future. But if you'd like to try something new (like integrate
with your metadata), please take a look at the new Connector API.

Note that, it's still evolving and API changes may happen in the next
release. We hope to stabilize it soon, but are still working on some
designs like a stable API to represent data (currently we are using
InternalRow).

On Sat, Feb 29, 2020 at 8:39 AM Mihir Sahu 
wrote:

> Hi Team,
>
> Wanted to know ahead of developing new datasource for Spark 3.x. Shall
> it be done using Datasource V2 or Datasource V1(via Relation) or there is
> any other plan.
>
> When I tried to build datasource using V2 for Spark 3.0, I could not
> find the associated classes and they seems to be moved out, however I am
> able to use DatasourceV1 to build the new datasources.
>
>  Wanted to know the path ahead for Datasource, so that I can build and
> contribute accordingly.
>
> Thanks
> Regards
> Mihir Sahu
>


Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
us to where we are with
>>> this today. Hopefully that will provide some clarity.
>>>
>>> The purpose of partition transforms is to allow source implementations
>>> to internally handle partitioning. Right now, users are responsible for
>>> this. For example, users will transform timestamps into date strings when
>>> writing and other people will provide a filter on those date strings when
>>> scanning. This is error-prone: users commonly forget to add partition
>>> filters in addition to data filters, if anyone uses the wrong format or
>>> transformation queries will silently return incorrect results, etc. But
>>> sources can (and should) automatically handle storing and retrieving data
>>> internally because it is much easier for users.
>>>
>>> When we first proposed transforms, I wanted to use Expression. But
>>> Reynold rightly pointed out that Expression is an internal API that should
>>> not be exposed. So we decided to compromise by building a public
>>> expressions API like the public Filter API for the initial purpose of
>>> passing transform expressions to sources. The idea was that Spark needs a
>>> public expression API anyway for other uses, like requesting a distribution
>>> and ordering for a writer. To keep things simple, we chose to build a
>>> minimal public expression API and expand it incrementally as we need more
>>> features.
>>>
>>> We also considered whether to parse all expressions and convert only
>>> transformations to the public API, or to parse just transformations. We
>>> went with just parsing transformations because it was easier and we can
>>> expand it to improve error messages later.
>>>
>>> I don't think there is reason to revert this simply because of some of
>>> the early choices, like deciding to start a public expression API. If you'd
>>> like to extend this to "fix" areas where you find it confusing, then please
>>> do. We know that by parsing more expressions we could improve error
>>> messages. But that's not to say that we need to revert it.
>>>
>>> None of this has been confusing or misleading for our users, who caught
>>> on quickly.
>>>
>>> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> I think the problem here is if there is an explicit plan or not.
>>>> The PR was merged one year ago and not many changes have been made to
>>>> this API to address the main concerns mentioned.
>>>> Also, the followup JIRA requested seems still open
>>>> https://issues.apache.org/jira/browse/SPARK-27386
>>>> I heard this was already discussed but I cannot find the summary of the
>>>> meeting or any documentation.
>>>>
>>>> I would like to make sure how we plan to extend. I had a couple of
>>>> questions such as:
>>>>   - Why can't we use UDF-interface-like as an example?
>>>>   - Is this expression only for partition or do we plan to expose this
>>>> to replace other existing expressions?
>>>>
>>>> > About extensibility, it's similar to DS V1 Filter again. We don't
>>>> cover all the expressions at the beginning, but we can add more in future
>>>> versions when needed. The data source implementations should be defensive
>>>> and fail when seeing unrecognized Filter/Transform.
>>>>
>>>> I think there are differences in that:
>>>> - DSv1 filter works whether the filters are pushed or not However, this
>>>> case does not work.
>>>> - There are too many expressions whereas the number of predicates are
>>>> relatively limited. If we plan to push all expressions eventually, I doubt
>>>> if this is a good idea.
>>>>
>>>>
>>>> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan 님이 작성:
>>>>
>>>>> The DS v2 project is still evolving so half-backed is inevitable
>>>>> sometimes. This feature is definitely in the right direction to allow more
>>>>> flexible partition implementations, but there are a few problems we can
>>>>> discuss.
>>>>>
>>>>> About expression duplication. This is an existing design choice. We
>>>>> don't want to expose the Expression class directly but we do need to 
>>>>> expose
>>>>> some Expression-like stuff in the developer APIs. So we pick some basic
>>>>> expressions, make a copy and create a public version of them

<    1   2   3   4   5   6   >