Re: Apache Spark 2.2.3 ?

2019-01-08 Thread Xiao Li
Thank you, Takeshi!

Dongjoon Hyun  于2019年1月8日周二 下午10:13写道:

> Great! Thank you, Takeshi! :D
>
> Bests,
> Dongjoon.
>
> On Tue, Jan 8, 2019 at 8:47 PM Takeshi Yamamuro 
> wrote:
>
>> If there is no other volunteer for the release of 2.3.3, I'd like to.
>>
>> best,
>> takeshi
>>
>> On Fri, Jan 4, 2019 at 11:49 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Sean!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Jan 3, 2019 at 2:50 PM Sean Owen  wrote:
>>>
 Yes, that one's not going to be back-ported to 2.3. I think it's fine
 to proceed with a 2.2 release with what's there now and call it done.
 Note that Spark 2.3 would be EOL around September of this year.

 On Thu, Jan 3, 2019 at 2:31 PM Dongjoon Hyun 
 wrote:

> Thank you for additional support for 2.2.3, Felix and Takeshi!
>
>
> The following is the update for Apache Spark 2.2.3 release.
>
> For correctness issues, two more patches landed on `branch-2.2`.
>
>   SPARK-22951 fix aggregation after dropDuplicates on empty
> dataframes
>   SPARK-25591 Avoid overwriting deserialized accumulator
>
> Currently, if we use the following JIRA search query, there exist one
> JIRA issue; SPARK-25206.
>
>   Query: project = SPARK AND fixVersion in (2.3.0, 2.3.1, 2.3.2,
> 2.3.3, 2.4.0, 2.4.1, 3.0.0) AND fixVersion not in (2.2.0, 2.2.1, 2.2.2,
> 2.2.3) AND affectedVersion in (2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1,
> 2.2.2, 2.2.3) AND labels in (Correctness, correctness)
>
> SPARK-25206 ( https://issues.apache.org/jira/browse/SPARK-25206 ) has
>
>   Affected Version: 2.2.2, 2.3.1
>   Target Versions: 2.3.2, 2.4.0
>   Fixed Version: 2.4.0
>
> Although SPARK-25206 is labeled as a correctness issue, 2.3.2 already
> missed it due to the technical difficulties and risks. Instead, it's 
> marked
> as a known issue. As we see, it's not targeted to 2.3.3, too.
>
> I know the correctness issue policy on new releases. However, for me,
> Spark 2.2.3 is a little bit exceptional release since it's a farewell
> release and branch-2.2 is already EOL and too far from the active branch
> master.
>
> So, I'd like to put SPARK-25206 out of the scope of the farewell
> release and recommend the users to use the other latest release. For
> example, Spark 2.4.0 for SPARK-25206.
>
> How do you think about that?
>
> Bests,
> Dongjoon.
>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: Apache Spark 2.2.3 ?

2019-01-08 Thread Dongjoon Hyun
Great! Thank you, Takeshi! :D

Bests,
Dongjoon.

On Tue, Jan 8, 2019 at 8:47 PM Takeshi Yamamuro 
wrote:

> If there is no other volunteer for the release of 2.3.3, I'd like to.
>
> best,
> takeshi
>
> On Fri, Jan 4, 2019 at 11:49 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Sean!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Jan 3, 2019 at 2:50 PM Sean Owen  wrote:
>>
>>> Yes, that one's not going to be back-ported to 2.3. I think it's fine to
>>> proceed with a 2.2 release with what's there now and call it done.
>>> Note that Spark 2.3 would be EOL around September of this year.
>>>
>>> On Thu, Jan 3, 2019 at 2:31 PM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for additional support for 2.2.3, Felix and Takeshi!


 The following is the update for Apache Spark 2.2.3 release.

 For correctness issues, two more patches landed on `branch-2.2`.

   SPARK-22951 fix aggregation after dropDuplicates on empty
 dataframes
   SPARK-25591 Avoid overwriting deserialized accumulator

 Currently, if we use the following JIRA search query, there exist one
 JIRA issue; SPARK-25206.

   Query: project = SPARK AND fixVersion in (2.3.0, 2.3.1, 2.3.2,
 2.3.3, 2.4.0, 2.4.1, 3.0.0) AND fixVersion not in (2.2.0, 2.2.1, 2.2.2,
 2.2.3) AND affectedVersion in (2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1,
 2.2.2, 2.2.3) AND labels in (Correctness, correctness)

 SPARK-25206 ( https://issues.apache.org/jira/browse/SPARK-25206 ) has

   Affected Version: 2.2.2, 2.3.1
   Target Versions: 2.3.2, 2.4.0
   Fixed Version: 2.4.0

 Although SPARK-25206 is labeled as a correctness issue, 2.3.2 already
 missed it due to the technical difficulties and risks. Instead, it's marked
 as a known issue. As we see, it's not targeted to 2.3.3, too.

 I know the correctness issue policy on new releases. However, for me,
 Spark 2.2.3 is a little bit exceptional release since it's a farewell
 release and branch-2.2 is already EOL and too far from the active branch
 master.

 So, I'd like to put SPARK-25206 out of the scope of the farewell
 release and recommend the users to use the other latest release. For
 example, Spark 2.4.0 for SPARK-25206.

 How do you think about that?

 Bests,
 Dongjoon.

>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Apache Spark 2.2.3 ?

2019-01-08 Thread Takeshi Yamamuro
If there is no other volunteer for the release of 2.3.3, I'd like to.

best,
takeshi

On Fri, Jan 4, 2019 at 11:49 AM Dongjoon Hyun 
wrote:

> Thank you, Sean!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jan 3, 2019 at 2:50 PM Sean Owen  wrote:
>
>> Yes, that one's not going to be back-ported to 2.3. I think it's fine to
>> proceed with a 2.2 release with what's there now and call it done.
>> Note that Spark 2.3 would be EOL around September of this year.
>>
>> On Thu, Jan 3, 2019 at 2:31 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for additional support for 2.2.3, Felix and Takeshi!
>>>
>>>
>>> The following is the update for Apache Spark 2.2.3 release.
>>>
>>> For correctness issues, two more patches landed on `branch-2.2`.
>>>
>>>   SPARK-22951 fix aggregation after dropDuplicates on empty
>>> dataframes
>>>   SPARK-25591 Avoid overwriting deserialized accumulator
>>>
>>> Currently, if we use the following JIRA search query, there exist one
>>> JIRA issue; SPARK-25206.
>>>
>>>   Query: project = SPARK AND fixVersion in (2.3.0, 2.3.1, 2.3.2,
>>> 2.3.3, 2.4.0, 2.4.1, 3.0.0) AND fixVersion not in (2.2.0, 2.2.1, 2.2.2,
>>> 2.2.3) AND affectedVersion in (2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1,
>>> 2.2.2, 2.2.3) AND labels in (Correctness, correctness)
>>>
>>> SPARK-25206 ( https://issues.apache.org/jira/browse/SPARK-25206 ) has
>>>
>>>   Affected Version: 2.2.2, 2.3.1
>>>   Target Versions: 2.3.2, 2.4.0
>>>   Fixed Version: 2.4.0
>>>
>>> Although SPARK-25206 is labeled as a correctness issue, 2.3.2 already
>>> missed it due to the technical difficulties and risks. Instead, it's marked
>>> as a known issue. As we see, it's not targeted to 2.3.3, too.
>>>
>>> I know the correctness issue policy on new releases. However, for me,
>>> Spark 2.2.3 is a little bit exceptional release since it's a farewell
>>> release and branch-2.2 is already EOL and too far from the active branch
>>> master.
>>>
>>> So, I'd like to put SPARK-25206 out of the scope of the farewell release
>>> and recommend the users to use the other latest release. For example, Spark
>>> 2.4.0 for SPARK-25206.
>>>
>>> How do you think about that?
>>>
>>> Bests,
>>> Dongjoon.
>>>



-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-08 Thread Wenchen Fan
Some more thoughts. If we support unlimited negative scale, why can't we
support unlimited positive scale? e.g. 0.0001 can be decimal(1, 4) instead
of (4, 4). I think we need more references here: how other databases deal
with decimal type and parse decimal literals?

On Mon, Jan 7, 2019 at 10:36 PM Wenchen Fan  wrote:

> I'm OK with it, i.e. fail the write if there are negative-scale decimals
> (we need to document it though). We can improve it later in data source v2.
>
> On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido 
> wrote:
>
>> In general we can say that some datasources allow them, others fail. At
>> the moment, we are doing no casting before writing (so we can state so in
>> the doc). But since there is ongoing discussion for DSv2, we can maybe add
>> a flag/interface there for "negative scale intollerant" DS and try and cast
>> before writing to them. What do you think about this?
>>
>> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan 
>> ha scritto:
>>
>>> AFAIK parquet spec says decimal scale can't be negative. If we want to
>>> officially support negative-scale decimal, we should clearly define the
>>> behavior when writing negative-scale decimals to parquet and other data
>>> sources. The most straightforward way is to fail for this case, but maybe
>>> we can do something better, like casting decimal(1, -20) to decimal(20, 0)
>>> before writing.
>>>
>>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido 
>>> wrote:
>>>
 Hi Wenchen,

 thanks for your email. I agree adding doc for decimal type, but I am
 not sure what you mean speaking of the behavior when writing: we are not
 performing any automatic casting before writing; if we want to do that, we
 need a design about it I think.

 I am not sure if it makes sense to set a min for it. That would break
 backward compatibility (for very weird use case), so I wouldn't do that.

 Thanks,
 Marco

 Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan <
 cloud0...@gmail.com> ha scritto:

> I think we need to do this for backward compatibility, and according
> to the discussion in the doc, SQL standard allows negative scale.
>
> To do this, I think the PR should also include a doc for the decimal
> type, like the definition of precision and scale(this one
> 
> looks pretty good), and the result type of decimal operations, and the
> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
> decimal(20, 0) before writing).
>
> Another question is, shall we set a min scale? e.g. shall we allow
> decimal(1, -1000)?
>
> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> a bit more than one month ago, I sent a proposal for handling
>> properly decimals with negative scales in our operations. This is a long
>> standing problem in our codebase as we derived our rules from Hive and
>> SQLServer where negative scales are forbidden, while in Spark they are 
>> not.
>>
>> The discussion has been stale for a while now. No more comments on
>> the design doc:
>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
>> .
>>
>> So I am writing this e-mail in order to check whether there are more
>> comments on it or we can go ahead with the PR.
>>
>> Thanks,
>> Marco
>>
>


Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread Wenchen Fan
+1

On Wed, Jan 9, 2019 at 3:37 AM DB Tsai  wrote:

> +1
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0x5CED8B896A6BDFA0
>
> On Tue, Jan 8, 2019 at 11:14 AM Dongjoon Hyun 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.2.3.
> >
> > The vote is open until January 11 11:30AM (PST) and passes if a majority
> +1 PMC votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.2.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.2.3-rc1 (commit
> 4acb6ba37b94b90aac445e6546426145a5f9eba2):
> > https://github.com/apache/spark/tree/v2.2.3-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1295
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-docs/
> >
> > The list of bug fixes going into 2.2.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12343560
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.2.3?
> > ===
> >
> > The current list of open tickets targeted at 2.2.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.2.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread DB Tsai
+1

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Tue, Jan 8, 2019 at 11:14 AM Dongjoon Hyun  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.2.3.
>
> The vote is open until January 11 11:30AM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.2.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.3-rc1 (commit 
> 4acb6ba37b94b90aac445e6546426145a5f9eba2):
> https://github.com/apache/spark/tree/v2.2.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1295
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-docs/
>
> The list of bug fixes going into 2.2.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343560
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.2.3?
> ===
>
> The current list of open tickets targeted at 2.2.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.2.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
2.2.3.

The vote is open until January 11 11:30AM (PST) and passes if a majority +1
PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.2.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.3-rc1 (commit
4acb6ba37b94b90aac445e6546426145a5f9eba2):
https://github.com/apache/spark/tree/v2.2.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1295

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.2.3-rc1-docs/

The list of bug fixes going into 2.2.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343560

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.2.3?
===

The current list of open tickets targeted at 2.2.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.2.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: Spark Scheduler - Task and job levels - How it Works?

2019-01-08 Thread Imran Rashid
Hi Miguel,

On Sun, Jan 6, 2019 at 11:35 AM Miguel F. S. Vasconcelos <
miguel.vasconce...@usp.br> wrote:

> When an action is performed onto a RDD, Spark send it as a job to the
> DAGScheduler;
> The DAGScheduler compute the execution DAG based on the RDD's lineage, and
> split the job into stages (using wide dependencies);
> The resulting stages are transformed into a set of tasks, that are sent to
> the TaskScheduler;
> The TaskScheduler send the set of tasks to the executors, where they will
> run.
>
> Is this flow correct?
>

yes, more or less, though that's really just the beginning.  Then there is
an endless back-and-forth as executors complete tasks, send info back to
the driver, the driver updates some state, perhaps just launches more tasks
in the existing tasksets, or creates more, or finishes jobs, etc.

And are the jobs  discovered during the application execution and sent
> sequentially to the DAGScheduler?
>

yes, there are very specific apis to create a job -- in the user guide,
these are called "actions".  Most of these are blocking, eg. when the user
calls rdd.count(), a job is created, potentially with a very long lineage
and many stages, and only when the entire job is completed does rdd.count()
complete.  There are a few async versions (eg. countAsync()), but from what
I've seen, the more common way to submit multiple concurrent jobs is to use
the regular apis from multiple threads.


> Regarding this part /"finds a minimal schedule to run the job"/, I have not
> found this algorithm for getting the minimal schedule. Can you help me?
>

I think its just using narrow dependencies, and re-using existing cached
data & shuffle data whenever possible.  that's implicit in the DAGScheduler
& RDD code.


> I'm in doubt if the scheduling is at Task level, job level, or both. These
> scheduling modes: FIFO and FAIR, are for tasks or jobs?
>

FIFO and FAIR only matter if you've got multiple concurrent jobs.  But then
it controls how you schedule tasks *within* those jobs.  When a job is
submitted, the DAGScheduler will still compute the DAG and the next taskset
to run for that job, even if the entire cluster is busy.  But then as
resources free up, it needs to decide which job to submit tasks from.  Note
you might have 2 active jobs, with 5 stages ready to submit tasks, and
another 20 stages still waiting for their dependencies to be computed,
further down the pipeline for those jobs.


> Also, as the TaskScheduler is an interface, is possible to "plug" different
> scheduling algorithms to it, correct?
>

yes, though spark itself only has one implementation (I believe Facebook
has their own, not sure of others?).  There is a ExternalClusterManager api
to let users plug in their own.


> But what about the DAGScheduler, is there any interface that allows
> plugging
> different scheduling algorithms to it?
>

there is no interface currently.

In the video "Introduction to AmpLab Spark Internals" its said that
> pluggable inter job scheduling is a possible future extension. Anyone knows
> if this has already been addressed ?
>

don't think so.


> I'm starting a master degree and I'd really like to contribute to Spark.
> Are
> there suggestions of issues in the spark scheduling that could be
> addressed??
>

there is lots to do, but this is tough to answer.  Depends on your
interests in particular.  Also, to be honest, the work that needs to be
done often doesn't align well with the kind of work you need to do for a
research project.  For example, adding existing features into cluster
managers (eg. kubernetes), or adding tests & chasing down concurrency bugs
might not interest a student.  OTOH, if you create an entirely different
implementation of the DAGScheduler and do some tests on its properties
under various loads -- that would be really interesting, but its also
unlikely to get accepted given the quality of code that normal comes from
research projects, and without finding folks in the community that
understand it well and are ready to maintain it.

You can search for jiras with component "Scheduler":
https://issues.apache.org/jira/issues/?jql=component%3Dscheduler%20and%20project%3DSPARK

there was some high level discussion a while back about a set of changes we
might consider making to scheduler, particularly for dealing w/ failures on
large clusters, but this never really picked up steam.  Might be
interesting for a research project:
https://issues.apache.org/jira/browse/SPARK-20178


Re: proposal for expanded & consistent timestamp types

2019-01-08 Thread Zoltan Ivanfi
Hi,

> ORC has long had a timestamp format. If extra attributes are needed on a 
> timestamp, as long as the default "no metadata" value isn't changed, then at 
> the file level things should be OK.
>
> more problematic is: what would happen to an existing app reading in 
> timestamps and ignoring any extra attributes. That way lies trouble

Maybe it would be best if the freshly introduced more explicit types
were not forwards-compatible. To be more precise, it would be enough
if only the "new" semantics were not forwards-compatible, it is fine
if older readers can read the "already existing" semantics, since that
is what they expect. Of course, this more fine-grained control is only
possible if there is a single "already existing" semantics only.
Whether that's the case or not depends on the file format as well.

> Talk to the format groups sooner rather than later

Thanks for the suggestion, I will write a small summary from that
perspective soon and contact the file format groups. I have Avro,
Parquet and ORC in mind. Any other file format group I should contact?
I plan to reach out to Arrow and Kudu as well. (Although strictly
speaking these are not file formats, yet they have their own type
systems as well.)

> What does Arrow do in this world, incidentally?

Arrow has a bit more options than just UTC-normalized or
timezone-agnostic. It supports arbitrary timezones as well:

/// The time zone is a string indicating the name of a time zone [...]
///
/// * If the time zone is null or equal to an empty string, the data is "time
/// zone naive" and shall be displayed *as is* to the user, not localized
/// to the locale of the user. [...]
///
/// * If the time zone is set to a valid value, values can be displayed as
/// "localized" to that time zone, even though the underlying 64-bit
/// integers are identical to the same data stored in UTC. [...]

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162

Br,

Zoltan



On Wed, Jan 2, 2019 at 5:36 PM Steve Loughran  wrote:
>
> OK, I've seen the document now. Probably the best summary of timestamps out 
> there I've ever seen.
>
> Irrespective of what historical stuff has done, the goal should be "make 
> everything consistent enough that cut and paste SQL queries over the same 
> data works" and "you shouldn't have to care about the persistence format *or 
> which app created the data*
>
> What does Arrow do in this world, incidentally?
>
>
> On 2 Jan 2019, at 11:48, Steve Loughran  wrote:
>
>
>
> On 17 Dec 2018, at 17:44, Zoltan Ivanfi  wrote:
>
> Hi,
>
> On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan  wrote:
>
> Shall we include Parquet and ORC? If they don't support it, it's hard for 
> general query engines like Spark to support it.
>
>
> For each of the more explicit timestamp types we propose a single
> semantics regardless of the file format. Query engines and other
> applications must explicitly support the new semantics, but it is not
> strictly necessary to extend or modify the file formats themselves,
> since users can declare the desired semantics directly in the end-user
> applications:
>
> - In SQL they would do so by using the more explicit timestamp types
> as detailed in the proposal. And since the SQL engines in question
> share the same metastore, users only have to define/update the SQL
> schema once to achieve interoperability in SQL.
>
> - Other applications will have to add support for the different
> semantics, but due to the large number of such applications, we can
> not coordinate all of that effort. Hopefully though, if we add support
> in the three major Hadoop SQL engines, other applications will follow
> suit.
>
> - Spark, specifically, falls into both of the categories mentioned
> above. It supports SQL queries, where it gets the benefit of the SQL
> schemas shared via the metastore. It also supports reading data files
> directly, where the correct timestamp semantics to use would have to
> be declared programmatically by the user/consumer of the API.
>
> That being said, although not strictly necessary, it is beneficial to
> store the semantics in some file-level metadata as well. This allows
> writers to record the intended semantics of timestamps and readers to
> recognize it, so no input is needed from the user when data is
> ingested from or exported to other tools. It will still require
> explicit support from the applications though. Parquet does have such
> metadata about the timestamp semantics: the isAdjustedToUTC field is
> part of the new parametric timestamp logical type. True means Instant
> semantics, while false means LocalDateTime semantics.
>
>
> I support the idea of adding similar metadata to other file formats as
> well, but I consider that to be a second step.
>
>
> ORC has long had a timestamp format. If extra attributes are needed on a 
> timestamp, as long as the default "no metadata" value isn't changed, then at 
> the file level things should be OK.
>
> more problematic is: what would