Re: proposal for expanded & consistent timestamp types

2018-12-17 Thread Zoltan Ivanfi
Hi,

On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan  wrote:

> Shall we include Parquet and ORC? If they don't support it, it's hard for 
> general query engines like Spark to support it.

For each of the more explicit timestamp types we propose a single
semantics regardless of the file format. Query engines and other
applications must explicitly support the new semantics, but it is not
strictly necessary to extend or modify the file formats themselves,
since users can declare the desired semantics directly in the end-user
applications:

- In SQL they would do so by using the more explicit timestamp types
as detailed in the proposal. And since the SQL engines in question
share the same metastore, users only have to define/update the SQL
schema once to achieve interoperability in SQL.

- Other applications will have to add support for the different
semantics, but due to the large number of such applications, we can
not coordinate all of that effort. Hopefully though, if we add support
in the three major Hadoop SQL engines, other applications will follow
suit.

- Spark, specifically, falls into both of the categories mentioned
above. It supports SQL queries, where it gets the benefit of the SQL
schemas shared via the metastore. It also supports reading data files
directly, where the correct timestamp semantics to use would have to
be declared programmatically by the user/consumer of the API.

That being said, although not strictly necessary, it is beneficial to
store the semantics in some file-level metadata as well. This allows
writers to record the intended semantics of timestamps and readers to
recognize it, so no input is needed from the user when data is
ingested from or exported to other tools. It will still require
explicit support from the applications though. Parquet does have such
metadata about the timestamp semantics: the isAdjustedToUTC field is
part of the new parametric timestamp logical type. True means Instant
semantics, while false means LocalDateTime semantics.

I support the idea of adding similar metadata to other file formats as
well, but I consider that to be a second step. First I would like to
reach an agreement in how different SQL timestamp types should behave.
(Until we follow this up with that second step, file formats with a
single non-parametric timestamp type can store arbitrary semantics
too, users just have to be aware of what timestamp semantics were used
when they create a SQL table over the data or read it in non-SQL
applications. Alternatively, we may limit the new types to file
formats with timestamp semantics metadata and postpone support for
other file formats until semantics metadata is added to them.)

Br,

Zoltan

>
> On Wed, Dec 12, 2018 at 3:36 AM Li Jin  wrote:
>>
>> Of course. I added some comments in the doc.
>>
>> On Tue, Dec 11, 2018 at 12:01 PM Imran Rashid  wrote:
>>>
>>> Hi Li,
>>>
>>> thanks for the comments!  I admit I had not thought very much about python 
>>> support, its a good point.  But I'd actually like to clarify one thing 
>>> about the doc -- though it discusses java types, the point is actually 
>>> about having support for these logical types at the SQL level.  The doc 
>>> uses java names instead of SQL names just because there is so much 
>>> confusion around the SQL names, as they haven't been implemented 
>>> consistently.  Once there is support for the additional logical types, then 
>>> we'd absolutely want to get the same support in python.
>>>
>>> Its great to hear there are existing python types we can map each behavior 
>>> to.  Could you add a comment on the doc on each of the types, mentioning 
>>> the equivalent in python?
>>>
>>> thanks,
>>> Imran
>>>
>>> On Fri, Dec 7, 2018 at 1:33 PM Li Jin  wrote:

 Imran,

 Thanks for sharing this. When working on interop between Spark and 
 Pandas/Arrow in the past, we also faced some issues due to the different 
 definitions of timestamp in Spark and Pandas/Arrow, because Spark 
 timestamp has Instant semantics and Pandas/Arrow timestamp has either 
 LocalDateTime or OffsetDateTime semantics. (Detailed discussion is in the 
 PR: https://github.com/apache/spark/pull/18664#issuecomment-316554156.)

 For one I am excited to see this effort going but also would love to see 
 interop of Python to be included/considered in the picture. I don't think 
 it adds much to what has already been proposed already because Python 
 timestamps are basically LocalDateTime or OffsetDateTime.

 Li



 On Thu, Dec 6, 2018 at 11:03 AM Imran Rashid 
  wrote:
>
> Hi,
>
> I'd like to discuss the future of timestamp support in Spark, in 
> particular with respect of handling timezones in different SQL types.   
> In a nutshell:
>
> * There are at least 3 different ways of handling the timestamp type 
> across timezone changes
> * We'd like Spark to clearly distinguish the 3 types (it currently 

How can I help?

2018-12-17 Thread Raghunadh Madamanchi
Hi,

I am Raghu, I live in Dallas,TX.
Having 15+  years of Experience in Software Development and Design using
Java related technologies,Hadoop, Hive..etc.

I wanted to get involved with this group by contributing my knowledge.
Please let me know, if you have something, which i can start working on.

Regards,
Raghu


[GitHub] HyukjinKwon commented on issue #162: Add a note about Spark build requirement at PySpark testing guide in Developer Tools

2018-12-17 Thread GitBox
HyukjinKwon commented on issue #162: Add a note about Spark build requirement 
at PySpark testing guide in Developer Tools
URL: https://github.com/apache/spark-website/pull/162#issuecomment-448075198
 
 
   adding @cloud-fan and @srowen.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] HyukjinKwon opened a new pull request #162: Add a note about Spark build requirement at PySpark testing guide in Developer Tools

2018-12-17 Thread GitBox
HyukjinKwon opened a new pull request #162: Add a note about Spark build 
requirement at PySpark testing guide in Developer Tools
URL: https://github.com/apache/spark-website/pull/162
 
 
   I received some feedback about running PySpark tests via private emails. 
Unlike SBT or Maven testing, PySpark testing script requires to build Apache 
Spark manually.
   I also realised that it might be confusing when we think about SBT, Maven 
and PySpark testing.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[GitHub] HyukjinKwon commented on issue #162: Add a note about Spark build requirement at PySpark testing guide in Developer Tools

2018-12-17 Thread GitBox
HyukjinKwon commented on issue #162: Add a note about Spark build requirement 
at PySpark testing guide in Developer Tools
URL: https://github.com/apache/spark-website/pull/162#issuecomment-448075651
 
 
   adding @squito as well FYI


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How can I help?

2018-12-17 Thread Hyukjin Kwon
Please take a look for https://spark.apache.org/contributing.html . It
contains virtually all information it needs for contributions.

2018년 12월 18일 (화) 오전 3:54, Raghunadh Madamanchi 님이
작성:

> Hi,
>
> I am Raghu, I live in Dallas,TX.
> Having 15+  years of Experience in Software Development and Design using
> Java related technologies,Hadoop, Hive..etc.
>
> I wanted to get involved with this group by contributing my knowledge.
> Please let me know, if you have something, which i can start working on.
>
> Regards,
> Raghu
>
>


Re: How can I help?

2018-12-17 Thread Raghunadh Madamanchi
Thank you, I will check it out.

On Mon, Dec 17, 2018 at 9:00 PM Hyukjin Kwon  wrote:

> Please take a look for https://spark.apache.org/contributing.html . It
> contains virtually all information it needs for contributions.
>
> 2018년 12월 18일 (화) 오전 3:54, Raghunadh Madamanchi <
> mailto.raghun...@gmail.com>님이 작성:
>
>> Hi,
>>
>> I am Raghu, I live in Dallas,TX.
>> Having 15+  years of Experience in Software Development and Design using
>> Java related technologies,Hadoop, Hive..etc.
>>
>> I wanted to get involved with this group by contributing my knowledge.
>> Please let me know, if you have something, which i can start working on.
>>
>> Regards,
>> Raghu
>>
>>


Why use EMPTY_DATA_SCHEMA when creating a datasource table

2018-12-17 Thread JackyLee
Hi, everyone

I have some questions about creating a datasource table.
In HiveExternalCatalog.createDataSourceTable,
newSparkSQLSpecificMetastoreTable will replace the table schema with
EMPTY_DATA_SCHEMA and table.partitionSchema.
So,Why we use EMPTY_DATA_SCHEMA? Why not declare schema in other way?
There are a lot of datasource tables that don't have partitionSchema, so
they will be replaced as EMPTY_DATA_SCHEMA?
Even if Spark itself can parse, what if the user views the table
information from the Hive side?

Any one can help me?
thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org