Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-30 Thread Michael Armbrust
Last call, anything else important in-flight for 2.2?

On Thu, May 25, 2017 at 10:56 AM, Michael Allman 
wrote:

> PR is here: https://github.com/apache/spark/pull/18112
>
>
> On May 25, 2017, at 10:28 AM, Michael Allman  wrote:
>
> Michael,
>
> If you haven't started cutting the new RC, I'm working on a documentation
> PR right now I'm hoping we can get into Spark 2.2 as a migration note, even
> if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888.
>
> Michael
>
>
> On May 22, 2017, at 11:39 AM, Michael Armbrust 
> wrote:
>
> I'm waiting for SPARK-20814
>  at Marcelo's
> request and I'd also like to include SPARK-20844
> .  I think we should
> be able to cut another RC midweek.
>
> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath  > wrote:
>
>> All the outstanding ML QA doc and user guide items are done for 2.2 so
>> from that side we should be good to cut another RC :)
>>
>>
>> On Thu, 18 May 2017 at 00:18 Russell Spitzer 
>> wrote:
>>
>>> Seeing an issue with the DataScanExec and some of our integration tests
>>> for the SCC. Running dataframe read and writes from the shell seems fine
>>> but the Redaction code seems to get a "None" when doing
>>> SparkSession.getActiveSession.get in our integration tests. I'm not
>>> sure why but i'll dig into this later if I get a chance.
>>>
>>> Example Failed Test
>>> https://github.com/datastax/spark-cassandra-connector/blob/
>>> v2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/
>>> spark/connector/sql/CassandraSQLSpec.scala#L311
>>>
>>> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
>>> failure: Task serialization failed: java.util.NoSuchElementException:
>>> None.get
>>> [info] java.util.NoSuchElementException: None.get
>>> [info] at scala.None$.get(Option.scala:347)
>>> [info] at scala.None$.get(Option.scala:345)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org$
>>> apache$spark$sql$execution$DataSourceScanExec$$redact(
>>> DataSourceScanExec.scala:70)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$
>>> 4.apply(DataSourceScanExec.scala:54)
>>> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$
>>> 4.apply(DataSourceScanExec.scala:52)
>>> ```
>>>
>>> Again this only seems to repo in our IT suite so i'm not sure if this is
>>> a real issue.
>>>
>>>
>>> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
>>> wrote:
>>>
 All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
 everyone who helped out on those!

 We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
 essentially all for documentation.

 Joseph

 On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
 wrote:

> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
> fixed, since the change that caused it is in branch-2.2. Probably a
> good idea to raise it to blocker at this point.
>
> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>  wrote:
> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
> create
> > another RC once ML has had time to take care of the more critical
> problems.
> > In the meantime please keep testing this release!
> >
> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <
> ishiz...@jp.ibm.com>
> > wrote:
> >>
> >> +1 (non-binding)
> >>
> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
> tests for
> >> core have passed.
> >>
> >> $ java -version
> >> openjdk version "1.8.0_111"
> >> OpenJDK Runtime Environment (build
> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn
> -Phadoop-2.7
> >> package install
> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
> core
> >> ...
> >> Run completed in 15 minutes, 12 seconds.
> >> Total number of tests run: 1940
> >> Suites: completed 206, aborted 0
> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
> >> All tests passed.
> >> [INFO]
> >> 
> 
> >> [INFO] BUILD SUCCESS
> >> [INFO]
> >> 
> 
> >> [INFO] Total time: 16:51 min
> >> [INFO] Finished at: 2017-05-09T17:51:04+09:00
> >> [INFO] Final Memory: 53M/514M
> >> [INFO]
> >> 
> 
> >> [WARNING] 

dev-unsubscr...@spark.apache.org

2017-05-30 Thread williamtellme123
 

 

From: Georg Heiler [mailto:georg.kf.hei...@gmail.com] 
Sent: Monday, May 29, 2017 2:23 PM
To: Spark Dev List 
Subject: Generic datasets implicit encoder missing

 

Hi,

 

Anyone knows what is wrong with using a generic 
https://stackoverflow.com/q/44247874/2587904 to construct a dataset? Even 
though the implicits are imported, they are missing. 

 

Regards Georg 



Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-30 Thread Zoltan Ivanfi
Hi,

If I remember correctly, the TIMESTAMP type had UTC-normalized local time
semantics even before Spark 2, so I can understand that Spark considers it
to be the "established" behavior that must not be broken. Unfortunately,
this behavior does not provide interoperability with other SQL engines of
the Hadoop stack.

Let me summarize the findings of this e-mail thread so far:

   - Timezone-agnostic TIMESTAMP semantics would be beneficial for
   interoperability and SQL compliance.
   - Spark can not make a breaking change. For backward-compatibility with
   existing data, timestamp semantics should be user-configurable on a
   per-table level.

Before going into the specifics of a possible solution, do we all agree on
these points?

Thanks,

Zoltan

On Sat, May 27, 2017 at 8:57 PM Imran Rashid  wrote:

> I had asked zoltan to bring this discussion to the dev list because I
> think it's a question that extends beyond a single jira (we can't figure
> out the semantics of timestamp in parquet if we don't k ow the overall goal
> of the timestamp type) and since its a design question the entire community
> should be involved.
>
> I think that a lot of the confusion comes because we're talking about
> different ways time zone affect behavior: (1) parsing and (2) behavior when
> changing time zones for processing data.
>
> It seems we agree that spark should eventually provide a timestamp type
> which does conform to the standard.   The question is, how do we get
> there?  Has spark already broken compliance so much that it's impossible to
> go back without breaking user behavior?  Or perhaps spark already has
> inconsistent behavior / broken compatibility within the 2.x line, so its
> not unthinkable to have another breaking change?
>
> (Another part of the confusion is on me -- I believed the behavior change
> was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
> think about this in context of what goes into a 2.2 release.  SPARK-18350
> isn't the origin of the difference in behavior.)
>
> First: consider processing data that is already stored in tables, and then
> accessing it from machines in different time zones.  The standard is clear
> that "timestamp" should be just like "timestamp without time zone": it does
> not represent one instant in time, rather it's always displayed the same,
> regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),
>  for hive tables stored as text files, and for spark's json formats.
>
> Spark 2.0.1  changed the behavior of the json format (I believe
> with SPARK-16216), so that it behaves more like timestamp *with* time
> zone.  It also makes csv behave the same (timestamp in csv was basically
> broken in 2.0.0).  However it did *not* change the behavior of a hive
> textfile; it still behaves like "timestamp with*out* time zone".  Here's
> some experiments I tried -- there are a bunch of files there for
> completeness, but mostly focus on the difference between
> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>
> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>
> Given that spark has changed this behavior post 2.0.0, is it still out of
> the question to change this behavior to bring it back in line with the sql
> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
> proposes, is the only option at this point to add an off-by-default feature
> flag to get "timestamp without time zone" semantics?
>
>
> Second, there is the question of parsing strings into timestamp type.
> I'm far less knowledgeable about this, so I mostly just have questions:
>
> * does the standard dictate what the parsing behavior should be for
> timestamp (without time zone) when a time zone is present?
>
> * if it does and spark violates this standard is it worth trying to retain
> the *other* semantics of timestamp without time zone, even if we violate
> the parsing part?
>
> I did look at what postgres does for comparison:
>
> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>
> spark's timestamp certainly does not match postgres's timestamp for
> parsing, it seems closer to postgres's "timestamp with timezone" -- though
> I dunno if that is standard behavior at all.
>
> thanks,
> Imran
>
> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin  wrote:
>
>> That's just my point 4, isn't it?
>>
>>
>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor 
>> wrote:
>>
>>> Reynold,
>>> my point is that Spark should aim to follow the SQL standard instead of
>>> rolling its own type system.
>>> If I understand correctly, the existing implementation is similar to
>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>> TIMEZONE data types which are missing from Spark.
>>> So, it is better (for me) if instead of extending the existing types,
>>> Spark would just implement the additional 

Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away.

In the medium term, it's much less likely that the distributed matrix
classes will be ported over to DataFrames (though the ideal would be to
have DataFrame-backed distributed matrix classes) - given the time and
effort it's taken just to port the various ML models and feature
transformers over to ML.

The current distributed matrices use the old mllib linear algebra
primitives for backing datastructures and ops, so those will have to be
ported at some point to the ml package vectors & matrices, though overall
functionality would remain the same initially I would expect.

There is https://issues.apache.org/jira/browse/SPARK-15882 that discusses
some of the ideas. The decision would still need to be made on the
higher-level API (whether it remains the same is current, or changes to be
DF-based, and/or changed in other ways, etc)

On Tue, 30 May 2017 at 15:33 John Compitello 
wrote:

> Hey all,
>
> I see on the MLLib website that there are plans to deprecate the RDD based
> API for MLLib once the new ML API reaches feature parity with RDD based
> one. Are there currently plans to reimplement all the distributed linear
> algebra / matrices operations as part of this new API, or are these things
> just going away? Like, will there still be a BlockMatrix class for
> distributed multiplies?
>
> Best,
>
> John
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


RDD MLLib Deprecation Question

2017-05-30 Thread John Compitello
Hey all, 

I see on the MLLib website that there are plans to deprecate the RDD based API 
for MLLib once the new ML API reaches feature parity with RDD based one. Are 
there currently plans to reimplement all the distributed linear algebra / 
matrices operations as part of this new API, or are these things just going 
away? Like, will there still be a BlockMatrix class for distributed multiplies? 

Best, 

John 



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org