Spark Issues on ORC

2017-05-26 Thread Dong Joon Hyun
Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.

SPARK-11412   Support merge schema for ORC
SPARK-12417   Orc bloom filter options are not propagated during file write in 
spark
SPARK-14286   Empty ORC table join throws exception
SPARK-14387   Enable Hive-1.x ORC compatibility with 
spark.sql.hive.convertMetastoreOrc
SPARK-15347   Problem select empty ORC table
SPARK-15474   ORC data source fails to write and read back empty dataframe
SPARK-15682   Hive ORC partition write looks for root hdfs folder for existence
SPARK-15731   orc writer directory permissions
SPARK-15757   Error occurs when using Spark sql ""select"" statement on orc 
file …
SPARK-16060   Vectorized Orc reader
SPARK-16628   OrcConversions should not convert an ORC table represented by 
MetastoreRelation to HadoopFsRelation if …
SPARK-17047   Spark 2 cannot create ORC table when CLUSTERED
SPARK-18355   Spark SQL fails to read data from a ORC hive table that has a new 
column added to it
SPARK-18540   Wholestage code-gen for ORC Hive tables
SPARK-19109   ORC metadata section can sometimes exceed protobuf message size 
limit
SPARK-19122   Unnecessary shuffle+sort added if join predicates ordering differ 
from bucketing and sorting order
SPARK-19430   Cannot read external tables with VARCHAR columns if they're 
backed by ORC files written by Hive 1.2.1
SPARK-19809   NullPointerException on empty ORC file
SPARK-20515   Issue with reading Hive ORC tables having char/varchar columns in 
Spark SQL
SPARK-20682   Implement new ORC data source based on Apache ORC
SPARK-20728   Make ORCFileFormat configurable between sql/hive and sql/core
SPARK-20799   Unable to infer schema for ORC on reading ORC from S3

Bests,
Dongjoon.


Re: Uploading PySpark 2.1.1 to PyPi

2017-05-26 Thread Xiao Li
Hi, Holden,

That sounds good to me!

Thanks,

Xiao

2017-05-23 16:32 GMT-07:00 Holden Karau :

> An account already exists, the PMC has the info for it. I think we will
> need to wait for the 2.2 artifacts to do the actual PyPI upload because of
> the local version string in 2.2.1, but rest assured this isn't something
> I've lost track of.
>
> On Wed, May 24, 2017 at 12:11 AM Xiao Li  wrote:
>
>> Hi, Holden,
>>
>> Based on the PR, https://github.com/pypa/packaging-problems/issues/90 ,
>> the limit has been increased to 250MB.
>>
>> Just wondering if we can publish PySpark to PyPI now? Have you created
>> the account?
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>>
>> 2017-05-12 11:35 GMT-07:00 Sameer Agarwal :
>>
>>> Holden,
>>>
>>> Thanks again for pushing this forward! Out of curiosity, did we get an
>>> approval from the PyPi folks?
>>>
>>> Regards,
>>> Sameer
>>>
>>> On Mon, May 8, 2017 at 11:44 PM, Holden Karau 
>>> wrote:
>>>
 So I have a PR to add this to the release process documentation - I'm
 waiting on the necessary approvals from PyPi folks before I merge that
 incase anything changes as a result of the discussion (like uploading to
 the legacy host or something). As for conda-forge, it's not something we
 need to do, but I'll add a note about pinging them when we make a new
 release so their users can keep up to date easily. The parent JIRA for PyPi
 related tasks is SPARK-18267 :)


 On Mon, May 8, 2017 at 6:22 PM cloud0fan  wrote:

> Hi Holden,
>
> Thanks for working on it! Do we have a JIRA ticket to track this? We
> should
> make it part of the release process in all the following Spark
> releases, and
> it will be great if we have a JIRA ticket to record the detailed steps
> of
> doing this and even automate it.
>
> Thanks,
> Wenchen
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Uploading-PySpark-
> 2-1-1-to-PyPi-tp21531p21532.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>>
>>> --
>>> Sameer Agarwal
>>> Software Engineer | Databricks Inc.
>>> http://cs.berkeley.edu/~sameerag
>>>
>>
>> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>


Re: New metrics for WindowExec with number of partitions and frames?

2017-05-26 Thread Reynold Xin
That would be useful (number of partitions).

On Fri, May 26, 2017 at 3:24 PM Jacek Laskowski  wrote:

> Hi,
>
> Currently WindowExec gives no metrics in the web UI's Details for Query
> page.
>
> What do you think about adding the number of partitions and frames?
> That could certainly be super useful, but am unsure if that's the kind
> of metrics Spark SQL shows in the details.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


New metrics for WindowExec with number of partitions and frames?

2017-05-26 Thread Jacek Laskowski
Hi,

Currently WindowExec gives no metrics in the web UI's Details for Query page.

What do you think about adding the number of partitions and frames?
That could certainly be super useful, but am unsure if that's the kind
of metrics Spark SQL shows in the details.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-26 Thread Reynold Xin
That's just my point 4, isn't it?


On Fri, May 26, 2017 at 1:07 AM, Ofir Manor  wrote:

> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to
> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types,
> Spark would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be
> an exercise of flags and incompatibilities.
>
> Regarding the current behaviour, if I remember correctly I had to force
> our spark O/S user into UTC so Spark wont change my timestamps.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin  wrote:
>
>> Zoltan,
>>
>> Thanks for raising this again, although I'm a bit confused since I've
>> communicated with you a few times on JIRA and on private emails to explain
>> that you have some misunderstanding of the timestamp type in Spark and some
>> of your statements are wrong (e.g. the except text file part). Not sure why
>> you didn't get any of those.
>>
>>
>> Here's another try:
>>
>>
>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>> before session local timezone change. IIUC, Spark has always assumed
>> timestamps to be with timezone, since it parses timestamps with timezone
>> and does all the datetime conversions with timezone in mind (it doesn't
>> ignore timezone if a timestamp string has timezone specified). The session
>> local timezone change further pushes Spark to that direction, but the
>> semantics has been with timezone before that change. Just run Spark on
>> machines with different timezone and you will know what I'm talking about.
>>
>> 2. CSV/Text is not different. The data type has always been "with
>> timezone". If you put a timezone in the timestamp string, it parses the
>> timezone.
>>
>> 3. We can't change semantics now, because it'd break all existing Spark
>> apps.
>>
>> 4. We can however introduce a new timestamp without timezone type, and
>> have a config flag to specify which one (with tz or without tz) is the
>> default behavior.
>>
>>
>>
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:
>>
>>> Hi,
>>>
>>> Sorry if you receive this mail twice, it seems that my first attempt did
>>> not make it to the list for some reason.
>>>
>>> I would like to start a discussion about SPARK-18350
>>>  before it gets
>>> released because it seems to be going in a different direction than what
>>> other SQL engines of the Hadoop stack do.
>>>
>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>> expresses readings from calendars and clocks and is unaffected by time
>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>> Presto also took steps 
>>> to become standards compliant. (Presto's design doc
>>> 
>>> also contains a great summary of the different semantics.) Hive has a
>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>> source of incompatibility that is already being addressed
>>> ). A TIMESTAMP in
>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>> type.
>>>
>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>> compliance and consistency with most SQL engines, I was wondering whether
>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>> adapt this semantics in the future, SPARK-18350
>>>  may turn out to be
>>> a source of problems. Please correct me if I'm wrong, but this change seems
>>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>>> better becoming timezone-agnostic instead of gaining further timezone-aware
>>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>>> change, so it must be optional and configurable by the user, as in Presto.)
>>>
>>> I would like to hear your opinions about this concern and about
>>> TIMESTAMP