TimSort bug

2018-08-31 Thread Reynold Xin
“As a byproduct of our study, we uncover a bug in the Java implementation
that can cause the sorting method to fail during the execution.”

http://drops.dagstuhl.de/opus/volltexte/2018/9467/

This might impact Spark since we took the Java based TimSort
implementation. I have seen in the wild TimSort failing in the past. Maybe
this is the cause.


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-31 Thread Reynold Xin
I actually had a similar use case a while ago, but not entirely the same.
In my use case, Spark is already up, but I want to make sure all existing
(and new) executors run some specific code. Can we update the API to
support that? I think that's doable if we split the design into two: one is
the ability to do what I just mentioned, and second is the ability to
register via config class when Spark starts to run the code.


On Thu, Aug 30, 2018 at 11:01 PM Felix Cheung 
wrote:

> +1
> --
> *From:* Mridul Muralidharan 
> *Sent:* Wednesday, August 29, 2018 1:27:27 PM
> *To:* dev@spark.apache.org
> *Subject:* Re: SPIP: Executor Plugin (SPARK-24918)
>
> +1
> I left a couple of comments in NiharS's PR, but this is very useful to
> have in spark !
>
> Regards,
> Mridul
> On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
>  wrote:
> >
> > I'd like to propose adding a plugin api for Executors, primarily for
> instrumentation and debugging (
> https://issues.apache.org/jira/browse/SPARK-24918).  The changes are
> small, but as its adding a new api, it might be spip-worthy.  I mentioned
> it as well in a recent email I sent about memory monitoring
> >
> > The spip proposal is here (and attached to the jira as well):
> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
> >
> > There are already some comments on the jira and pr, and I hope to get
> more thoughts and opinions on it.
> >
> > thanks,
> > Imran
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


data source api v2 refactoring

2018-08-31 Thread Reynold Xin
I spent some time last week looking at the current data source v2 apis, and
I thought we should be a bit more buttoned up in terms of the abstractions
and the guarantees Spark provides. In particular, I feel we need the
following levels of "abstractions", to fit the use cases in Spark, from
batch, to streaming.

Please don't focus on the naming at this stage. When possible, I draw
parallels to what similar levels are named in the currently committed api:

0. Format: This represents a specific format, e.g. Parquet, ORC. There is
currently no explicit class at this level.

1. Table: This should represent a logical dataset (with schema). This could
be just a directory on the file system, or a table in the catalog.
Operations on tables can include batch reads (Scan), streams, writes, and
potentially other operations such as deletes. The closest to the table
level abstraction in the current code base is the "Provider" class,
although Provider isn't quite a Table. This is similar to Ryan's proposed
design.

2. Stream: Specific to streaming. A stream is created out of a Table. This
logically represents a an instance of a StreamingQuery. Pushdowns and
options are handled at this layer. I.e. Spark guarnatees to data source
implementation pushdowns and options don't change within a Stream. Each
Stream consists of a sequence of scans. There is no equivalent concept in
the current committed code.

3. Scan: A physical scan -- either as part of a streaming query, or a batch
query. This should contain sufficient information and methods so we can run
a Spark job over a defined subset of the table. It's functionally
equivalent to an RDD, except there's no dependency on RDD so it is a
smaller surface. In the current code, the equivalent class would be the
ScanConfig, which represents the information needed, but in order to
execute a job, ReadSupport is needed (various methods in ReadSupport takes
a ScanConfig).


To illustrate with pseudocode what the different levels mean, a batch query
would look like the following:

val provider = reflection[Format]("parquet")
val table = provider.createTable(options)
val scan = table.createScan(scanConfig) // scanConfig includes pushdown and
options
// run tasks on executors

A streaming micro-batch scan would look like the following:

val provider = reflection[Format]("parquet")
val table = provider.createTable(options)
val stream = table.createStream(scanConfig)

while(true) {
  val scan = streamingScan.createScan(startOffset)
  // run tasks on executors
}


Vs the current API, the above:

1. Creates an explicit Table abstraction, and an explicit Scan abstraction.

2. Have an explicit Stream level and makes it clear pushdowns and options
are handled there, rather than at the individual scan (ReadSupport) level.
Data source implementations don't need to worry about pushdowns or options
changing mid-stream. For batch, those happen when the scan object is
created.



This email is just a high level sketch. I've asked Wenchen to prototype
this, to see if it is actually feasible and the degree of hacks it removes,
or creates.


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Reynold Xin
Let's see how they go. At some point we do need to cut the release. That
argument can be made on every feature, and different people place different
value / importance on different features, so we could just end up never
making a release.



On Thu, Aug 30, 2018 at 1:56 PM antonkulaga  wrote:

> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
> coming
> up and we don't need to block Spark 2.4 on this.
>
> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
> to
> suffer many months until Spark 2.5 with 2.12 support will be released.
> Scala
> 2.12 is not only about Spark but also about a lot of Scala libraries that
> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
> then people will not be able to use them in their Zeppelin, Jupyter and
> other notebooks together with Spark.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: python tests: any reason for a huge tests.py?

2018-08-24 Thread Reynold Xin
We should break it.

On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid 
wrote:

> Hi,
>
> another question from looking more at python recently.  Is there any
> reason we've got a ton of tests in one humongous tests.py file, rather than
> breaking it out into smaller files?
>
> Having one huge file doesn't seem great for code organization, and it also
> makes the test parallelization in run-tests.py not work as well.  On my
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.
>
> can we at least try to put new tests into smaller files?
>
> thanks,
> Imran
>


Re: Porting or explicitly linking project style in Apache Spark based on https://github.com/databricks/scala-style-guide

2018-08-23 Thread Reynold Xin
I wrote both the Spark one and later the Databricks one. The latter had a
lot more work put into it and is consistent with the Spark style. I'd just
use the second one and link to it, if possible.



On Thu, Aug 23, 2018 at 6:38 PM Hyukjin Kwon  wrote:

> If you meant "Code Style Guide", many of them are missing and it refers
> https://docs.scala-lang.org/style/ not
> https://github.com/databricks/scala-style-guide (please correct me if I
> misunderstood).
> For instance, I lately guided 2 indents for line continuation but I found
> it's actually not in the official guide (which is rather usual in Spark's
> code base as far as I can tell FWIW).
>
> Can we just simply leave a link there instead?
>
>
> 2018년 8월 24일 (금) 오전 9:35, Matei Zaharia 님이 작성:
>
>> There’s already a code style guide listed on
>> http://spark.apache.org/contributing.html. Maybe it’s the same? We
>> should decide which one we actually want and update this page if it’s wrong.
>>
>> Matei
>>
>> > On Aug 23, 2018, at 6:33 PM, Sean Owen  wrote:
>> >
>> > Seems OK to me. The style is pretty standard Scala style anyway. My
>> guidance is always to follow the code around the code you're changing.
>> >
>> > On Thu, Aug 23, 2018 at 8:14 PM Hyukjin Kwon 
>> wrote:
>> > Hi all,
>> >
>> > I usually follow https://github.com/databricks/scala-style-guide for
>> Apache Spark's style, which is usually generally the same with the Spark's
>> code base in practice.
>> > Thing is, we don't explicitly mention this within Apache Spark as far
>> as I can tell.
>> >
>> > Can we explicitly mention this or port this style guide? It doesn't
>> necessarily mean hard requirements for PRs or code changes but we could at
>> least encourage people to read it.
>> >
>>
>>


Re: Spark DataFrame UNPIVOT feature

2018-08-21 Thread Reynold Xin
Probably just because it is not used that often and nobody has submitted a
patch for it. I've used pivot probably on average once a week (primarily in
spreadsheets), but I've never used unpivot ...


On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali  wrote:

> Hi there,
>
> I was looking into why the UNPIVOT feature isn't implemented, given that
> Spark already has PIVOT implemented natively in the DataFrame/Dataset API.
>
> Came across this JIRA  which
> talks about implementing PIVOT in Spark 1.6, but no mention whatsoever
> regarding UNPIVOT, even though the JIRA curiously references a blog post
> that talks about both PIVOT and UNPIVOT :)
>
> Is this because UNPIVOT is just simply generating multiple slim tables by
> selecting each column, and making a union out of all of them?
>
> Thank you!
>
> --
> Regards,
>
>
> Ivan Gozali
> Lecida
> Email: i...@lecida.com
>


Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Reynold Xin
Personally I'd love for R support to be in 2.4, but I don't consider
something "Done" unless tests are running ... Is the proposal: the release
manager manually run the R tests when preparing the release, and switch
over to fully integrated Jenkins after 2.4.0 is released?

On Wed, Aug 15, 2018 at 2:45 PM Reynold Xin  wrote:

> What's the reason we don't want to do the OS updates right now? Is it due
> to the unpredictability of potential issues that might happen and end up
> delaying 2.4 release?
>
>
> On Wed, Aug 15, 2018 at 2:33 PM Erik Erlandson 
> wrote:
>
>> The SparkR support PR is finished, along with integration testing,
>> however Shane has requested that the integration testing not be enabled
>> until after the 2.4 release because it requires the OS updates he wants to
>> test *after* the release.
>>
>> The integration testing can be run locally, and so the question at hand
>> is: would the PMC be willing to consider inclusion of the SparkR for 2.4,
>> based on local verification of the testing? The PySpark PR was merged under
>> similar circumstances: the testing was verified locally and the PR was
>> merged before the testing was enabled for jenkins.
>>
>> Cheers,
>> Erik
>>
>


Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Reynold Xin
What's the reason we don't want to do the OS updates right now? Is it due
to the unpredictability of potential issues that might happen and end up
delaying 2.4 release?


On Wed, Aug 15, 2018 at 2:33 PM Erik Erlandson  wrote:

> The SparkR support PR is finished, along with integration testing, however
> Shane has requested that the integration testing not be enabled until after
> the 2.4 release because it requires the OS updates he wants to test *after*
> the release.
>
> The integration testing can be run locally, and so the question at hand
> is: would the PMC be willing to consider inclusion of the SparkR for 2.4,
> based on local verification of the testing? The PySpark PR was merged under
> similar circumstances: the testing was verified locally and the PR was
> merged before the testing was enabled for jenkins.
>
> Cheers,
> Erik
>


Re: Naming policy for packages

2018-08-15 Thread Reynold Xin
craps? :(

On Wed, Aug 15, 2018 at 11:47 AM Koert Kuipers  wrote:

> ok it doesnt sound so bad if the maven identifier can have spark it in. no
> big deal!
>
> otherwise i was going to suggest "kraps". like kraps-xml
>
> scala> "spark".reverse
> res0: String = kraps
>
>
> On Wed, Aug 15, 2018 at 2:43 PM, Sean Owen  wrote:
>
>> I'd refer you again to the trademark policy. In the first link I see
>> projects whose software ID is like "spark-foo" but title/subtitle is like
>> "Foo for Apache Spark". This is OK. 'sparklyr' is in a gray area we've
>> talked about before; see https://www.apache.org/foundation/marks/ as
>> well. I think it's in a gray area, myself.
>>
>> My best advice to anyone is to avoid this entirely by just not naming
>> your project anything like 'spark'.
>>
>> On Wed, Aug 15, 2018 at 10:39 AM <0xf0f...@protonmail.com> wrote:
>>
>>> Does it mean that majority of Spark related projects, including top
>>> Datatbricks (
>>> https://github.com/databricks?utf8=%E2%9C%93=spark==)
>>> or RStudio (sparklyr) contributions, violate the trademark?
>>>
>>>
>>> Sent with ProtonMail  Secure Email.
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On August 15, 2018 5:51 PM, Sean Owen  wrote:
>>>
>>> You might be interested in the full policy:
>>> https://spark.apache.org/trademarks.html
>>>
>>> What it is trying to prevent is confusion. Is spark-xml from the Spark
>>> project? Sounds like it but who knows ? What is a vendor releases ASFSpark
>>> 3.0? Are people going to think this is an official real project release?
>>>
>>> You can release 'Foo for Apache Spark'. You can use shorthand like
>>> foo-spark in software identifiers like Maven coordinates.
>>>
>>> Keeping trademark rights is essential in OSS and part of it is making an
>>> effort to assert that right.
>>>
>>>
>>>
>


Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Reynold Xin
Sorry I completely disagree with using Expression in critical public APIs
that we expect a lot of developers to use. There's a huge difference
between exposing InternalRow vs Expression. InternalRow is a relatively
small surface (still quite large) that I can see ourselves within a version
getting to a point to make it stable, while Expression is everything in
Spark SQL, including all the internal implementations, referencing logical
plans and physical plans (due to subqueries). They weren't designed as
public APIs, and it is simply not feasible to make them public APIs without
breaking things all the time. I can however see ourselves creating a
smaller scope, parallel public expressions API, similar to what we did for
dsv1.

If we are depending on Expressions on the more common APIs in dsv2 already,
we should revisit that.




On Mon, Aug 13, 2018 at 1:59 PM Ryan Blue  wrote:

> Reynold, did you get a chance to look at my response about using
> `Expression`? I think that it's okay since it is already exposed in the v2
> data source API. Plus, I wouldn't want to block this on building a public
> expression API that is more stable.
>
> I think that's the only objection to this SPIP. Anyone else want to raise
> an issue with the proposal, or is it about time to bring up a vote thread?
>
> rb
>
> On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue  wrote:
>
>> I don’t think that we want to block this work until we have a public and
>> stable Expression. Like our decision to expose InternalRow, I think that
>> while this option isn’t great, it at least allows us to move forward. We
>> can hopefully replace it later.
>>
>> Also note that the use of Expression is in the plug-in API, not in the
>> public API. I think that it is easier to expect data source implementations
>> to handle some instability here. We already use Expression as an option
>> for push-down in DSv2 so there’s precedent for it. Plus, we need to be able
>> to pass more complex expressions between the sources and Spark for sorting
>> and clustering data when it’s written to DSv2 (SPARK-23889
>> <https://issues.apache.org/jira/browse/SPARK-23889>).
>>
>> Simple expressions for bucketing and column-based partitions would almost
>> certainly be stable. We can probably find a trade-off solution to not use
>> Expression in the TableCatalog API, but we definitely need expressions for
>> SPARK-23889.
>>
>> SortOrder would be easier to replace with a more strict class based on
>> only column data rather than expressions. For #21306
>> <https://github.com/apache/spark/pull/21306>, I just left it out
>> entirely. What if I just removed it from the proposal and we can add it
>> later?
>> ​
>>
>> On Thu, Jul 26, 2018 at 4:32 PM Reynold Xin  wrote:
>>
>>> Seems reasonable at high level. I don't think we can use Expression's
>>> and SortOrder's in public APIs though. Those are not meant to be public and
>>> can break easily across versions.
>>>
>>>
>>> On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue 
>>> wrote:
>>>
>>>> The recently adopted SPIP to standardize logical plans requires a way
>>>> for to plug in providers for table metadata operations, so that the new
>>>> plans can create and drop tables. I proposed an API to do this in a
>>>> follow-up SPIP on APIs for Table Metadata Operations
>>>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>.
>>>> This thread is to discuss that proposal.
>>>>
>>>> There are two main parts:
>>>>
>>>>- A public facing API for creating, altering, and dropping tables
>>>>- An API for catalog implementations to provide the underlying
>>>>table operations
>>>>
>>>> The main need is for the plug-in API, but I included the public one
>>>> because there isn’t currently a friendly public API to create tables and I
>>>> think it helps to see how both would work together.
>>>>
>>>> Here’s a sample of the proposed public API:
>>>>
>>>> catalog.createTable("db.table")
>>>> .addColumn("id", LongType)
>>>> .addColumn("data", StringType, nullable=true)
>>>> .addColumn("ts", TimestampType)
>>>> .partitionBy(day($"ts"))
>>>> .config("prop", "val")
>>>> .commit()
>>>>
>>>> And here’s a sample of the catalog plug-in API:
>>>>
>>>> Table createTable(
>>>> Tab

Re: Naming policy for packages

2018-08-15 Thread Reynold Xin
Unfortunately that’s an Apache foundation policy and the Spark community
has no power to change it. My understanding: The reason Spark can’t be in
the name is because if it is used frequently enough, the foundation would
lose the Spark trademark. Cheers.

On Wed, Aug 15, 2018 at 7:19 AM Simon Dirmeier 
wrote:

> Hey,
> thanks for clearning that up.
> Imho this is somewhat unfortunate, because package names that contain
> "spark", somewhat promote and advertise Apache Spark, right?
>
> Best,
> Simon
>
> Am 15.08.18 um 14:00 schrieb Sean Owen:
>
> You raise a great point, and we were just discussing this. The page is old
> and contains many projects that were listed before the trademarks we're
> being enforced. Some have renamed themselves. We will update the page and
> remove stale or noncompliant projects and ask those that need to change to
> do so.
>
> You are correct that the guidance you quote is current and should be
> followed.
>
> Note there is an exception for software identifiers.
>
> On Wed, Aug 15, 2018, 6:13 AM Simon Dirmeier 
> wrote:
>
>> Dear all,
>>
>> I am currently developing two OSS extension packages for spark; one
>> related to machine learning; one related to biological applications.
>> According to the trademark guidelines (
>> https://spark.apache.org/trademarks.html) I am not allowed to use
>> *Names derived from “Spark”, such as “sparkly”. *
>> My question is if that is really the case or how stringent these
>> guidelines are, given that so many spark packages (
>> https://spark.apache.org/third-party-projects.html) contain Spark as
>> name already. I already contacted the official email for questions like
>> these, but didn't hear back until now.
>>
>> Can anyone please shed light on this?
>> Thanks in advance!
>>
>> Best,
>> Simon
>>
>
>


Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Reynold Xin
SGTM

On Fri, Aug 10, 2018 at 1:39 PM shane knapp  wrote:

> https://issues.apache.org/jira/browse/SPARK-25089
>
> basically since these branches are old, and there will be a greater than
> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean
> and i are proposing to remove the lint-r checks for the builds.
>
> this is super not important for the 2.4 cut/code freeze, but i wanted to
> get this done before it gets pushed down my queue and before we revisit the
> ubuntu port.
>
> thanks in advance,
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
I actually totally agree that we should make sure it should have no impact
on existing code if the feature is not used.


On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson  wrote:

> I don't have a comprehensive knowledge of the project hydrogen PRs,
> however I've perused them, and they make substantial modifications to
> Spark's core DAG scheduler code.
>
> What I'm wondering is: how high is the confidence level that the
> "traditional" code paths are still stable. Put another way, is it even
> possible to "turn off" or "opt out" of this experimental feature? This
> analogy isn't perfect, but for example the k8s back-end is a major body of
> code, but it has a very small impact on any *core* code paths, and so if
> you opt out of it, it is well understood that you aren't running any
> experimental code.
>
> Looking at the project hydrogen code, I'm less sure the same is true.
> However, maybe there is a clear way to show how it is true.
>
>
> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
> wrote:
>
>> No reasonable amount of time is likely going to be sufficient to fully
>> vet the code as a PR. I'm not entirely happy with the design and code as
>> they currently are (and I'm still trying to find the time to more publicly
>> express my thoughts and concerns), but I'm fine with them going into 2.4
>> much as they are as long as they go in with proper stability annotations
>> and are understood not to be cast-in-stone final implementations, but
>> rather as a way to get people using them and generating the feedback that
>> is necessary to get us to something more like a final design and
>> implementation.
>>
>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>> wrote:
>>
>>>
>>> Barrier mode seems like a high impact feature on Spark's core code: is
>>> one additional week enough time to properly vet this feature?
>>>
>>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>>> joseph.tor...@databricks.com> wrote:
>>>
 Full continuous processing aggregation support ran into unanticipated
 scalability and scheduling problems. We’re planning to overcome those by
 using some of the barrier execution machinery, but since barrier execution
 itself is still in progress the full support isn’t going to make it into
 2.4.

 Jose

 On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
 tomasz.gaw...@outlook.com> wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as
> I
> remember, Jose Torres said it should  be easy to perform aggregations
> if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us:
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
> inner range optimization) but I think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we
> should consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It
> only has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release,
> but there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!),
> and is also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. 
> I
> believe this should be
> >> the last significant API change to data source v2, so great to have
> in Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is

Re: Review notification bot

2018-07-30 Thread Reynold Xin
I like the idea of this bot, but I'm somewhat annoyed by it. I have touched
a lot of files and wrote a lot of the original code. Everyday I wake up I
get a lot of emails from this bot.

Also if we are going to use this, can we rename the bot to something like
spark-bot, rather than holden's personal bot?

On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon  wrote:

> > That being said the folks being pinged are not just committers.
>
> I doubt it because only pinged ones I see are all committers and that's
> why I assumed the pinging is based on who committed the PR (which implies
> committer only).
> Do you maybe have some examples where non-committers were pinged? Looks at
> least, (almost?) all of them are committers and something needs to be fixed 
> even
> if so.
>
> I recently argued about pinging things before - sounds it matters if it
> annoys. Since pinging is completely optional and cc'ing someone else might
> need other contexts not
> only assuming from the blame and who committed this, I am actually not
> super happy with that pinging for now. I was slightly supportive for this
> idea but now I actually slightly
> became negative on this after observing how it goes in practice.
>
> I wonder how other people think on this.
>
>
>
> 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:
>
>> So CODEOWNERS is limited to committers by GitHub. We can definitely
>> modify the config file though and I'm happy to write some custom logic if
>> it helps support our needs. We can also just turn it off if it's too noisey
>> for folks in general.
>>
>> That being said the folks being pinged are not just committers. The hope
>> is to get more code authors who aren't committers involved in the reviews
>> and then eventually become committers.
>>
>> On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:
>>
>>> *reviewers: I mean people who committed the PR given my observation.
>>>
>>> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>>>
 I was wondering if we can leave the configuration open and accept some
 custom configurations, IMHO, because I saw some people less related or less
 active are consistently pinged. Just started to get worried if they get
 annoyed by this.
 Also, some people could be interested in few specific areas. They
 should get pinged too.
 Also, assuming from people pinged, seems they are reviewers (which
 basically means committers I guess). Was wondering if there's a big
 difference between codeowners and bots.



 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:

> Th configuration file is optional, is there something you want to try
> and change?
>
> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
> wrote:
>
>> I see. Thanks. I was wondering if I can see the configuration file
>> since that looks needed (
>> https://github.com/holdenk/mention-bot#configuration) but I couldn't
>> find (sorry if it's just something I simply missed).
>>
>> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>>
>>> So the one that is running is the the form in my own repo (set up
>>> for K8s deployment) - http://github.com/holdenk/mention-bot
>>>
>>> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Holden, so, is it a fork in
 https://github.com/facebookarchive/mention-bot? Would you mind if
 I ask where I can see the configurations for it?


 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이
 작성:

> Yeah so the issue with codeowners is it will only assign to
> committers on the repo (the Beam project found this out the practical
> application way).
>
> I have a fork of mention bot running and it seems we can add it
> (need an infra ticket), but one of the things the Beam folks asked 
> was to
> not ping code authors who haven’t committed in the past year which I 
> need
> to do a bit of poking on to make happen.
>
> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On this topic, I just stumbled on a GitHub feature called
>> CODEOWNERS .
>> It lets you specify owners of specific areas of the repository using 
>> the
>> same syntax that .gitignore uses. Here is CPython's CODEOWNERS
>> file
>> 
>> for reference.
>>
>> Dunno if that would complement mention-bot (which Facebook is
>> apparently no longer maintaining
>> ), or if
>> we can even use it given the ASF setup on GitHub. But I thought it 
>> would be
>> worth mentioning nonetheless.
>>
>> On Sat, Jul 14, 2018 at 11:17 AM 

Re: Why percentile and distinct are not done in one job?

2018-07-30 Thread Reynold Xin
Which API are you talking about?

On Mon, Jul 30, 2018 at 7:03 AM 吴晓菊  wrote:

> I noticed that in column analyzing, 2 jobs will run separately to
> calculate percentiles and then distinct. Why not combine into one job since
> HyperLogLog also supports merge?
>
> Chrysan Wu
> Phone:+86 17717640807
>
>


Re: [Spark SQL] Future of CalendarInterval

2018-07-27 Thread Reynold Xin
CalendarInterval is definitely externally visible.

E.g. sql("select interval 1 day").dtypes would return "Array[(String,
String)] = Array((interval 1 days,CalendarIntervalType))"

However, I'm not sure what it means to support casting. What are the
semantics for casting from any other data type to calendar interval? I can
see string casting and casting from itself, but not any other data types.




On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires 
wrote:

> Hi Sparkers! (maybe Sparkles ?)
>
> I just wanted to bring up the apparently ?controversial? Calendar Interval
> topic.
>
> I worked on: https://issues.apache.org/jira/browse/SPARK-24702,
> https://github.com/apache/spark/pull/21706
>
> The user was reporting an unexpected behaviour where he/she wasn’t able to
> cast to a Calendar Interval type.
>
> In the current version of Spark the following code works:
>
> scala> spark.sql("SELECT 'interval 1 hour' as 
> a").select(col("a").cast("calendarinterval")).show()++|   
> a|++|interval 1 hours|++
>
>
> While the following doesn’t:
> spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()
>
>
> Since the DataFrame API equivalent of the SQL worked, I thought adding it
> would be an easy decision to make (to make it consistent)
>
> However, I got push-back on the PR on the basis that “*we do not plan to
> expose Calendar Interval as a public type*”
> Should there be a consensus on either cleaning up the public DataFrame API
> out of CalendarIntervalType OR making it consistent with the SQL ?
>
> --
> Best regards,
> Daniel Mateus Pires
> Data Engineer @ Hudson's Bay Company
>


Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-26 Thread Reynold Xin
Seems reasonable at high level. I don't think we can use Expression's and
SortOrder's in public APIs though. Those are not meant to be public and can
break easily across versions.


On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue  wrote:

> The recently adopted SPIP to standardize logical plans requires a way for
> to plug in providers for table metadata operations, so that the new plans
> can create and drop tables. I proposed an API to do this in a follow-up SPIP
> on APIs for Table Metadata Operations
> .
> This thread is to discuss that proposal.
>
> There are two main parts:
>
>- A public facing API for creating, altering, and dropping tables
>- An API for catalog implementations to provide the underlying table
>operations
>
> The main need is for the plug-in API, but I included the public one
> because there isn’t currently a friendly public API to create tables and I
> think it helps to see how both would work together.
>
> Here’s a sample of the proposed public API:
>
> catalog.createTable("db.table")
> .addColumn("id", LongType)
> .addColumn("data", StringType, nullable=true)
> .addColumn("ts", TimestampType)
> .partitionBy(day($"ts"))
> .config("prop", "val")
> .commit()
>
> And here’s a sample of the catalog plug-in API:
>
> Table createTable(
> TableIdentifier ident,
> StructType schema,
> List partitions,
> Optional> sortOrder,
> Map properties)
>
> Note that this API passes both bucketing and column-based partitioning as
> Expressions. This is a generalization that makes it possible for the table
> to use the relationship between columns and partitions. In the example
> above, data is partitioned by the day of the timestamp field. Because the
> expression is passed to the table, the table can use predicates on the
> timestamp to filter out partitions without an explicit partition predicate.
> There’s more detail in the proposal on this.
>
> The SPIP is for the APIs and does not cover how multiple catalogs would be
> exposed. I started a separate discussion thread on how to access multiple
> catalogs and maintain compatibility with Spark’s current behavior (how to
> get the catalog instance in the above example).
>
> Please use this thread to discuss the proposed APIs. Thanks, everyone!
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread Reynold Xin
John,

You want to create a ticket and submit a patch for this? If there is a
coalesce hint, inject a coalesce logical node. Pretty simple.


On Wed, Jul 25, 2018 at 2:48 PM John Zhuge  wrote:

> Thanks for the comment, Forest. What I am asking is to make whatever DF
> repartition/coalesce functionalities available to SQL users.
>
> Agree with you on that reducing the final number of output files by file
> size is very nice to have. Lukas indicated this is planned.
>
> On Wed, Jul 25, 2018 at 2:31 PM Forest Fang 
> wrote:
>
>> Sorry I see https://issues.apache.org/jira/browse/SPARK-6221 was
>> referenced in John's email. Can you elaborate how is your requirement
>> different? In my experience, it usually is driven by the need to decrease
>> the final output parallelism without compromising compute parallelism (i.e.
>> to prevent too many small files to be persisted on HDFS.) The requirement
>> in my experience is often pretty ballpark and does not require precise
>> number of partitions. Therefore setting the desired output size to say
>> 32-64mb usually gives a good enough result. I'm curious why 6221 was marked
>> as won't fix.
>>
>> On Wed, Jul 25, 2018 at 2:26 PM Forest Fang 
>> wrote:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>>
>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>> wrote:
>>>
>>>> See some of the related discussion under
>>>> https://github.com/apache/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention 
>>>>> if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>> /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>>>>> appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>
> --
> John Zhuge
>


Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Reynold Xin
Seems like a good idea in general. Do other systems have similar concepts?
In general it'd be easier if we can follow existing convention if there is
any.


On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:

> Hi all,
>
> Many Spark users in my company are asking for a way to control the number
> of output files in Spark SQL. There are use cases to either reduce or
> increase the number. The users prefer not to use function *repartition*(n)
> or *coalesce*(n, shuffle) that require them to write and deploy
> Scala/Java/Python code.
>
> Could we introduce a query hint for this purpose (similar to Broadcast
> Join Hints)?
>
> /*+ *COALESCE*(n, shuffle) */
>
> In general, is query hint is the best way to bring DF functionality to SQL
> without extending SQL syntax? Any suggestion is highly appreciated.
>
> This requirement is not the same as SPARK-6221 that asked for auto-merging
> output files.
>
> Thanks,
> John Zhuge
>


Re: [SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
A work-in-progress PR: https://github.com/apache/spark/pull/21822

The PR also adds the infrastructure to throw exceptions in test mode when
the various transform methods are used as part of analysis. Unfortunately
there are couple edge cases that do need that, and as a result there is
this ugly bypassTransformAnalyzerCheck method.




On Thu, Jul 19, 2018 at 2:52 PM Reynold Xin  wrote:

> We have had multiple bugs introduced by AnalysisBarrier. In hindsight I
> think the original design before analysis barrier was much simpler and
> requires less developer knowledge of the infrastructure.
>
> As long as analysis barrier is there, developers writing various code in
> analyzer will have to be aware of this special node and we are bound to
> have more bugs in the future due to people not considering it.
>
>
> Filed this JIRA ticket: https://issues.apache.org/jira/browse/SPARK-24865
>
>
>
> AnalysisBarrier was introduced in SPARK-20392
> <https://issues.apache.org/jira/browse/SPARK-20392> to improve analysis
> speed (don't re-analyze nodes that have already been analyzed).
>
> Before AnalysisBarrier, we already had some infrastructure in place, with
> analysis specific functions (resolveOperators and resolveExpressions).
> These functions do not recursively traverse down subplans that are already
> analyzed (with a mutable boolean flag _analyzed). The issue with the old
> system was that developers started using transformDown, which does a
> top-down traversal of the plan tree, because there was not top-down
> resolution function, and as a result analyzer performance became pretty bad.
>
> In order to fix the issue in SPARK-20392
> <https://issues.apache.org/jira/browse/SPARK-20392>, AnalysisBarrier was
> introduced as a special node and for this special node,
> transform/transformUp/transformDown don't traverse down. However, the
> introduction of this special node caused a lot more troubles than it
> solves. This implicit node breaks assumptions and code in a few places, and
> it's hard to know when analysis barrier would exist, and when it wouldn't.
> Just a simple search of AnalysisBarrier in PR discussions demonstrates it
> is a source of bugs and additional complexity.
>
> Instead, I think a much simpler fix to the original issue is to introduce
> resolveOperatorsDown, and change all places that call transformDown in the
> analyzer to use that. We can also ban accidental uses of the various
> transform* methods by using a linter (which can only lint specific
> packages), or in test mode inspect the stack trace and fail explicitly if
> transform* are called in the analyzer.
>
>
>
>
>
> On Thu, Jul 19, 2018 at 11:41 AM Xiao Li  wrote:
>
>> dfWithUDF.cache()
>> dfWithUDF.write.saveAsTable("t")
>> dfWithUDF.write.saveAsTable("t1")
>>
>>
>> Cached data is not being used. It causes a big performance regression.
>>
>>
>>
>>
>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>
>>> What regression are you referring to here? A -1 vote really needs a
>>> rationale.
>>>
>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>>
>>>> I would first vote -1.
>>>>
>>>> I might find another regression caused by the analysis barrier. Will
>>>> keep you posted.
>>>>
>>>>
>>


Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
Looking at the list of pull requests it looks like this is the ticket:
https://issues.apache.org/jira/browse/SPARK-24867



On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:

> I don't think my ticket should block this release. It's a big general
> refactoring.
>
> Xiao do you have a ticket for the bug you found?
>
>
> On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao 
> wrote:
>
>> Hi Xiao,
>>
>> Are you referring to this JIRA (
>> https://issues.apache.org/jira/browse/SPARK-24865)?
>>
>> Xiao Li  于2018年7月20日周五 上午2:41写道:
>>
>>> dfWithUDF.cache()
>>> dfWithUDF.write.saveAsTable("t")
>>> dfWithUDF.write.saveAsTable("t1")
>>>
>>>
>>> Cached data is not being used. It causes a big performance regression.
>>>
>>>
>>>
>>>
>>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>>
>>>> What regression are you referring to here? A -1 vote really needs a
>>>> rationale.
>>>>
>>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>>>
>>>>> I would first vote -1.
>>>>>
>>>>> I might find another regression caused by the analysis barrier. Will
>>>>> keep you posted.
>>>>>
>>>>>
>>>


Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
I don't think my ticket should block this release. It's a big general
refactoring.

Xiao do you have a ticket for the bug you found?


On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao  wrote:

> Hi Xiao,
>
> Are you referring to this JIRA (
> https://issues.apache.org/jira/browse/SPARK-24865)?
>
> Xiao Li  于2018年7月20日周五 上午2:41写道:
>
>> dfWithUDF.cache()
>> dfWithUDF.write.saveAsTable("t")
>> dfWithUDF.write.saveAsTable("t1")
>>
>>
>> Cached data is not being used. It causes a big performance regression.
>>
>>
>>
>>
>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>
>>> What regression are you referring to here? A -1 vote really needs a
>>> rationale.
>>>
>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>>
 I would first vote -1.

 I might find another regression caused by the analysis barrier. Will
 keep you posted.


>>


[SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
We have had multiple bugs introduced by AnalysisBarrier. In hindsight I
think the original design before analysis barrier was much simpler and
requires less developer knowledge of the infrastructure.

As long as analysis barrier is there, developers writing various code in
analyzer will have to be aware of this special node and we are bound to
have more bugs in the future due to people not considering it.


Filed this JIRA ticket: https://issues.apache.org/jira/browse/SPARK-24865



AnalysisBarrier was introduced in SPARK-20392
 to improve analysis
speed (don't re-analyze nodes that have already been analyzed).

Before AnalysisBarrier, we already had some infrastructure in place, with
analysis specific functions (resolveOperators and resolveExpressions).
These functions do not recursively traverse down subplans that are already
analyzed (with a mutable boolean flag _analyzed). The issue with the old
system was that developers started using transformDown, which does a
top-down traversal of the plan tree, because there was not top-down
resolution function, and as a result analyzer performance became pretty bad.

In order to fix the issue in SPARK-20392
, AnalysisBarrier was
introduced as a special node and for this special node,
transform/transformUp/transformDown don't traverse down. However, the
introduction of this special node caused a lot more troubles than it
solves. This implicit node breaks assumptions and code in a few places, and
it's hard to know when analysis barrier would exist, and when it wouldn't.
Just a simple search of AnalysisBarrier in PR discussions demonstrates it
is a source of bugs and additional complexity.

Instead, I think a much simpler fix to the original issue is to introduce
resolveOperatorsDown, and change all places that call transformDown in the
analyzer to use that. We can also ban accidental uses of the various
transform* methods by using a linter (which can only lint specific
packages), or in test mode inspect the stack trace and fail explicitly if
transform* are called in the analyzer.





On Thu, Jul 19, 2018 at 11:41 AM Xiao Li  wrote:

> dfWithUDF.cache()
> dfWithUDF.write.saveAsTable("t")
> dfWithUDF.write.saveAsTable("t1")
>
>
> Cached data is not being used. It causes a big performance regression.
>
>
>
>
> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>
>> What regression are you referring to here? A -1 vote really needs a
>> rationale.
>>
>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>
>>> I would first vote -1.
>>>
>>> I might find another regression caused by the analysis barrier. Will
>>> keep you posted.
>>>
>>>
>


Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Reynold Xin
+1 on this, on the condition that we can come up with a design that will
remove the existing plans.


On Tue, Jul 17, 2018 at 11:00 AM Ryan Blue  wrote:

> Hi everyone,
>
> From discussion on the proposal doc and the discussion thread, I think we
> have consensus around the plan to standardize logical write operations for
> DataSourceV2. I would like to call a vote on the proposal.
>
> The proposal doc is here: SPIP: Standardize SQL logical plans
> 
> .
>
> This vote is for the plan in that doc. The related SPIP with APIs to
> create/alter/drop tables will be a separate vote.
>
> Please vote in the next 72 hours:
>
> [+1]: Spark should adopt the SPIP
> [-1]: Spark should not adopt the SPIP because . . .
>
> Thanks for voting, everyone!
>
> --
> Ryan Blue
>


Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-15 Thread Reynold Xin
Makes sense. Thanks for looking into this.

On Sun, Jul 15, 2018 at 1:51 PM Sean Owen  wrote:

> Yesterday I cleaned out old Spark releases from the mirror system -- we're
> supposed to only keep the latest release from active branches out on
> mirrors. (All releases are available from the Apache archive site.)
>
> Having done so I realized quickly that the
> HiveExternalCatalogVersionsSuite relies on the versions it downloads being
> available from mirrors. It has been flaky, as sometimes mirrors are
> unreliable. I think now it will not work for any versions except 2.3.1,
> 2.2.2, 2.1.3.
>
> Because we do need to clean those releases out of the mirrors soon anyway,
> and because they're flaky sometimes, I propose adding logic to the test to
> fall back on downloading from the Apache archive site.
>
> ... and I'll do that right away to unblock
> HiveExternalCatalogVersionsSuite runs. I think it needs to be backported to
> other branches as they will still be testing against potentially
> non-current Spark releases.
>
> Sean
>


Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Reynold Xin
Yes I would just reuse the same function.

On Sun, Jul 8, 2018 at 5:01 AM Li Jin  wrote:

> Hi Linar,
>
> This seems useful. But perhaps reusing the same function name is better?
>
>
> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
>
> Currently createDataFrame takes an RDD of any kind of SQL data
> representation(e.g. row, tuple, int, boolean, etc.), or list, or
> pandas.DataFrame.
>
> Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
> args too?
>
> What do other people think.
>
> Li
>
> On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion 
> wrote:
>
>> We've created a snippet that creates a Spark DF from a RDD of many pandas
>> DFs in a distributed manner that does not require the driver to collect the
>> entire dataset.
>>
>> Early tests show a performance improvement of x6-x10 over using
>> pandasDF->Rows>sparkDF.
>>
>> I've seen that there are some open pull requests that change the way
>> arrow serialization work, Should I open a pull request to add this
>> functionality to SparkSession? (`createFromPandasDataframesRDD`)
>>
>> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>>
>> Thanks,
>> Linar
>>
>
>


Re: [DESIGN] Barrier Execution Mode

2018-07-08 Thread Reynold Xin
Xingbo,

Please reference the spip and jira ticket next time:  [SPARK-24374] SPIP:
Support Barrier Scheduling in Apache Spark

On Sun, Jul 8, 2018 at 9:45 AM Xingbo Jiang  wrote:

> Hi All,
>
> I would like to invite you to review the design document for Barrier
> Execution Mode:
>
> https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#
>
> TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a
> major part of the project involves significant changes to execution mode of
> Spark. This design doc proposes new APIs as well as new execution mode
> (known as barrier execution mode) to provide high-performance support for
> DL workloads.
>
> Major changes include:
>
>- Add RDDBarrier to support gang scheduling.
>- Add BarrierTaskContext to support global sync of all tasks in a
>stage;
>- Better fault tolerance approach for barrier stage, that in case some
>tasks fail in the middle, retry all tasks in the same stage.
>- Integrate barrier execution mode with Standalone cluster manager.
>
> Please feel free to review and discuss on the design proposal.
>
> Thanks,
> Xingbo
>
>


code freeze and branch cut for Apache Spark 2.4

2018-07-06 Thread Reynold Xin
FYI 6 mo is coming up soon since the last release. We will cut the branch
and code freeze on Aug 1st in order to get 2.4 out on time.


Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also
feel the two projects have very different background and maturity phase.
There are 1300+ contributors to Spark, and only 300 to Beam, with the vast
majority of contributions coming from a single company for Beam (based on
my cursory look at the two pages of commits on github). With the recent
security and correctness storms, I actually worry about more quality (which
requires more infrastructure) than just people adding more code to the
project.



On Mon, Jul 2, 2018 at 5:25 PM Holden Karau  wrote:

> As someone who floats a bit between both projects (as a contributor) I'd
> love to see us adopt some of these techniques to be pro-active about
> growing our committer-ship (I think perhaps we could do this by also moving
> some of the newer committers into the PMC faster so there are more eyes out
> looking for people to bring forward)?
>
> On Mon, Jul 2, 2018 at 4:54 PM, Sean Owen  wrote:
>
>> Worth, I think, a read and consideration from Spark folks. I'd be
>> interested in comments; I have a few reactions too.
>>
>>
>> -- Forwarded message -
>> From: Kenneth Knowles 
>> Date: Sat, Jun 30, 2018 at 1:15 AM
>> Subject: Beam's recent community development work
>> To: , , Griselda Cuevas <
>> g...@apache.org>, dev 
>>
>>
>> Hi all,
>>
>> The ASF board suggested that we (Beam) share some of what we've been
>> doing for community development with d...@community.apache.org and
>> memb...@apache.org. So here is a long description. I have included
>> d...@beam.apache.org because it is the subject, really, and this is &
>> should be all public knowledge.
>>
>> We would love feedback! We based a lot of this on reading the community
>> project site, and probably could have learned even more with more study.
>>
>> # Background
>>
>> We face two problems in our contributor/committer-base:
>>
>> 1. Not enough committers to review all the code being contributed, in
>> part due to recent departure of a few committers
>> 2. We want our contributor-base (hence committer-base) to be more spread
>> across companies and backgrounds, for the usual Apache reasons. Our user
>> base is not active and varied enough to make this automatic. One solution
>> is to make the right software to get a varied user base, but that is a
>> different thread :-) so instead we have to work hard to build our community
>> around the software we have.
>>
>> # What we did
>>
>> ## Committer guidelines
>>
>> We published committer guidelines [1] for transparency and as an
>> invitation. We start by emphasizing that there are many kinds of
>> contributions, not just code (we have committers from community
>> development, tech writing, training, etc). Then we have three aspects:
>>
>> 1. ASF code of conduct
>> 2. ASF committer responsibilities
>> 3. Beam-specific committer responsibilities
>>
>> The best way to understand is to follow the link at the bottom of this
>> email. The important part is that you shouldn't be proposing a committer
>> for other reasons, and you shouldn't be blocking a committer for other
>> reasons.
>>
>> ## Instead of just "[DISCUSS] Potential committer XYZ" we discuss every
>> layer
>>
>> Gris (CC'd) outlined this: people go through these phases of relationship
>> with our project:
>>
>> 1. aware of it
>> 2. interested in it / checking it out
>> 3. using it for real
>> 4. first-time contributor
>> 5. repeat contributor
>> 6. committer
>> 7. PMC
>>
>> As soon as we notice someone, like a user asking really deep questions,
>> we invite discussion on private@ on how we can move them to the next
>> level of engagement.
>>
>> ## Monthly cadence
>>
>> Every ~month, we call for new discussions and revisit ~all prior
>> discussions. This way we do not forget to keep up this effort.
>>
>> ## Individual discussions
>>
>> For each person we have a separate thread on private@. This ensures we
>> have quality focused discussions that lead to feedback. In collective
>> discussions that we used to do, we often didn't really come up with
>> actionable feedback and ended up not even contacting potential committers
>> to encourage them. And consensus was much less clear.
>>
>> ## Feedback!
>>
>> If someone is brought up for a discussion, that means they got enough
>> attention that we hope to engage them more. But unsolicited feedback is
>> never a good idea. For a potential committer, we did this:
>>
>> 1. Send an email saying something like "you were discussed as a potential
>> committer - do you want to become one? do you want feedback?"
>> 2. If they say yes (so far everyone) we send a few bullet points from the
>> discussion and *most important* tie each bullet to the committer
>> guidelines. If we have no feedback about which guidelines were a concern,
>> that is a red flag that we are being driven by bias.
>>
>> We saw a *very* significant increase in engagement from those we sent
>> feedback to, and the trend is that they almost all will become 

Re: Feature request: Java-specific transform method in Dataset

2018-07-01 Thread Reynold Xin
This wouldn’t be a problem with Scala 2.12 right?

On Sun, Jul 1, 2018 at 12:23 PM Sean Owen  wrote:

> I see, transform() doesn't have the same overload that other methods do in
> order to support Java 8 lambdas as you'd expect. One option is to introduce
> something like MapFunction for transform and introduce an overload.
>
> I think transform() isn't used much at all, so maybe why it wasn't
> Java-fied. Before Java 8 it wouldn't have made much sense in Java. Now it
> might. I think it could be OK to add the overload to match how map works.
>
> On Sun, Jul 1, 2018 at 1:33 PM Ismael Carnales 
> wrote:
>
>> No, because Function1 from Scala is not a functional interface.
>> You can see a simple example of what I'm trying to accomplish In the unit
>> test here:
>>
>> https://github.com/void/spark/blob/java-transform/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java#L73
>>
>>
>> On Sun, Jul 1, 2018 at 2:48 PM Sean Owen  wrote:
>>
>>> Don't Java 8 lambdas let you do this pretty immediately? Can you give an
>>> example here of what you want to do and how you are trying to do it?
>>>
>>> On Sun, Jul 1, 2018, 12:42 PM Ismael Carnales 
>>> wrote:
>>>
 Hi,
  it would be nice to have an easier way to use the Dataset transform
 method from Java than implementing a Function1 from Scala.

 I've made a simple implentation here:

 https://github.com/void/spark/tree/java-transform

 Should I open a JIRA?

 Ismael Carnales

>>>


Re: LICENSE and NOTICE file content

2018-06-21 Thread Reynold Xin
Thanks Justin. Can you submit a pull request?

On Thu, Jun 21, 2018 at 8:10 PM Justin Mclean 
wrote:

> Hi,
>
> We’ve recently had a number of incubating projects copy your LICENSE and
> NOTICE files as they see Spark as a popular project and they are a little
> sad when the IPMC votes -1 on their releases.
>
> Now I'm not on your PMC, don’t know your projects history and there may be
> valid reasons for the current LICENSE and NOTICE contents so take this as
> some friendly advice, you can choose to ignore it or not act on it. Looking
> at your latest source release (2.3.1), I can see there seems too much
> information in LICENSE and especially NOTICE for a source release. It may
> be that the LICENE and NOTICE is intended for the binary release? [1] But
> even if that is teh case it also seems to be missing a couple of licenses
> for bundled software.
>
> But in general my alarm bells start ringing because:
> - Category B licenses are listed (which shouldn't be in a source release)
> - License information is listed in NOTICE when it should be in LICENSE
> - Dependancies are listed rather than what is actually bundled
>
> Taking a look at the release I can see this 3rd party code bundled:
>
> MIT licensed (some is dual licensed):
> dagre-d3
> datatables
> jquery cookies
> SortTable
> Modernizr
> matchMedia polyfill*
> respond*
> dataTables bootstrap*
> jQuery
> jQuery datatables*
> grap lib-dot
> jquery block UI
> anchorJS
> jsonFormatter
>
> Apache licensed:
> vis.js*
> bootstrap*
> bootstrap-tooltip*
> toposort.py*
> TimSort*
> LimitedInputStream.java*
>
> BSD licensed:
> d3
> cloudpickle
> join*
>
> Python licensed
> heapq3
>
> CC0 licensed:
> ./data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg*
>
> * Are currently missing from license
>
> So that would end up with a number of licenses in LICENSE but nothing
> added to a boiler plate NOTICE file. The ALv2 licensed items don’t have
> NOTICE files so there no impact there. I could of course have missed
> something and could be wrong for a number of reasons but I cannot see how
> the above makes the NOTICE file 667 lines long :-)
>
> I also noticed some compiled code in the source release which probably
> shouldn’t be there. [2]
> spark-2.3.1/core/src/test/resources/TestUDTF.jar
> spark-2.3.1/sql/hive/src/test/resources/SPARK-21101-1.0.jar
> spark-2.3.1/sql/hive/src/test/resources/TestUDTF.jar
> spark-2.3.1/sql/hive/src/test/resources/hive-contrib-0.13.1.jar
>
> spark-2.3.1/sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
> spark-2.3.1/sql/hive/src/test/resources/data/files/TestSerDe.jar
>
> spark-2.3.1/sql/hive/src/test/resources/regression-test-SPARK-8489/test-2.10.jar
>
> spark-2.3.1/sql/hive/src/test/resources/regression-test-SPARK-8489/test-2.11.jar
> spark-2.3.1/sql/hive-thriftserver/src/test/resources/TestUDTF.jar
>
> Thanks,
> Justin
>
> PS please cc me on replies as I’m not subscribed to your mailing list
>
> 1. http://www.apache.org/dev/licensing-howto.html#binary
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: What about additional support on deeply nested data?

2018-06-20 Thread Reynold Xin
Seems like you are also looking for transform and reduce for arrays?

https://issues.apache.org/jira/browse/SPARK-23908

https://issues.apache.org/jira/browse/SPARK-23911

On Wed, Jun 20, 2018 at 10:43 AM bobotu  wrote:

> I store some trajectories data in parquet with this schema:
>
> create table traj (
>   id   string,
>   points array lat:  double,
> lng: double,
> time:   bigint,
> speed: double,
> ... lots attributes here
> candidate_roads: arraystructlinestring: string, score:
> double>>
>>>
> )
>
> It contains a lots of attribute comes from sensors. It also have a nested
> array which contains information generated during map-matching algorithm.
>
> All of my algorithm run on this dataset is trajectory-oriented, which means
> they often do iteration on points, and use a subset of point's attributes
> to
> do some computation. With this schema I can get points of trajectory
> without
> doing `group by` and `collect_list`.
>
> Because Parquet works very well on deeply nested data, so I directly store
> it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on
> nested data:
>
> select
>   id,
>   avg_speed
> from
>   traj t,
>   (select avg(speed) avg_speed from t.points where time < '2018-06-19')
>
> As you can see, Impala treat array of structs as a table nested in each
> row,
> and can do some computation on array elements at pre-row level. And Impala
> will use Parquet's features to prune unused attributes in point struct.
>
> I use Spark for some complex algorithm which cannot written in pure SQL.
> But
> I meet some trouble with Spark DataFrame API:
> 1. Spark cannot do schema prune and filter push-down on nested column.
> And it seems like there is no handy syntax to play with deeply nested data.
> 2. `explode` not help in my scenario, because I need to preserve the
> trajectory-points hierarchy. If I use `explode` here, I need do a extra
> `group by` on `id`.
> 3. Although, I can directly select `points.lat`, but it lost it
> structure. If I need array of (lat, lng) pair, I need to zip two array. And
> it cannot work at deeper nested level, such as select
> `points.candidate_road.score`.
> 4. Maybe I can use parquet-mr to read file as RDD, set read schema and
> push-down filters. But this manner lost Hive integration and the table
> abstraction.
>
> So, I think it is nice to have some additional supports on nested data.
> Maybe an Impala style subquery syntax on complex data, or something like a
> schema projection function on nested data like:
>
> select id, extract(points, lat, lng, extract(candidate_road, score)) from
> traj
>
> which produce schema as:
>
> |- id string
> |- points array of struct
> |- lat double
> |- lng double
> |- candidate_road array of struct
> |- score double
>
> And user can play with points with desired schema and data prune in
> Parquet.
>
> Or if there are some existing syntax to done my work?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new feature.
> How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache Spark
> not yet supporting Scala 2.12, and that got me to think perhaps it is about
> time for Spark to work towards the 3.0 release. By the time it comes out,
> it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to give
> more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016.
> If we were to maintain the ~ 2 year cadence, it is time to work on Spark
> 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
> >>>
> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
> >>>
> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temp

Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Reynold Xin
+1 on the proposal.


On Fri, Jun 1, 2018 at 8:17 PM Hossein  wrote:

> Hi Shivaram,
>
> We converged on a CRAN release process that seems identical to current
> SparkR.
>
> --Hossein
>
> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Hossein -- Can you clarify what the resolution on the repository /
>> release issue discussed on SPIP ?
>>
>> Shivaram
>>
>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>> wrote:
>> > +1
>> > With my concerns in the SPIP discussion.
>> >
>> > 
>> > From: Hossein 
>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>> > To: dev@spark.apache.org
>> > Subject: [VOTE] SPIP ML Pipelines in R
>> >
>> > Hi,
>> >
>> > I started discussion thread for a new R package to expose MLlib
>> pipelines in
>> > R.
>> >
>> > To summarize we will work on utilities to generate R wrappers for MLlib
>> > pipeline API for a new R package. This will lower the burden for
>> exposing
>> > new API in future.
>> >
>> > Following the SPIP process, I am proposing the SPIP for a vote.
>> >
>> > +1: Let's go ahead and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I do not think this is a good idea for the following reasons.
>> >
>> > Thanks,
>> > --Hossein
>>
>
>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Reynold Xin
The behavior change is not good...

On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:

> Ah, looks like it's this change:
>
> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>
> It seems strange that by default Spark doesn't build with Hive but by
> default PySpark requires it...
>
> This might also be a behavior change to PySpark users that build Spark
> without Hive. The old behavior is "fall back to non-hive support" and the
> new behavior is "program won't start".
>
> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>
>> I think you would have to build with the 'hive' profile? but if so that
>> would have been true for a while now.
>>
>>
>> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>>
>>> Hey all,
>>>
>>> I just did a clean checkout of github.com/apache/spark but failed to
>>> start PySpark, this is what I did:
>>>
>>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
>>> bin/pyspark
>>>
>>> And got this exception:
>>>
>>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>>
>>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>>
>>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>>
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>>
>>> Setting default log level to "WARN".
>>>
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>>
>>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>>> UserWarning: Failed to initialize Spark session.
>>>
>>>   warnings.warn("Failed to initialize Spark session.")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
>>> 41, in 
>>>
>>> spark = SparkSession._create_shell_session()
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>>> line 564, in _create_shell_session
>>>
>>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>>
>>> TypeError: 'JavaPackage' object is not callable
>>>
>>> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
>>> but no luck. I wonder:
>>>
>>>
>>>1. I have not seen this before, could this be caused by recent
>>>change to head?
>>>2. Am I doing something wrong in the build process?
>>>
>>>
>>> Thanks much!
>>> Li
>>>
>>>
>


Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Reynold Xin
But from the user's perspective, optimization is not run right? So it is
still lazy.


On Fri, Jun 8, 2018 at 12:35 PM Li Jin  wrote:

> Hi All,
>
> Sorry for the long email title. I am a bit surprised to find that the
> current optimizer rule "ConvertToLocalRelation" causes expressions to be
> eager-evaluated in planning phase, this can be demonstrated with the
> following code:
>
> scala> val myUDF = udf((x: String) => { println("UDF evaled"); "result" })
>
> myUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(,StringType,Some(List(StringType)))
>
>
> scala> val df = Seq(("test", 1)).toDF("s", "i").select(myUDF($"s"))
>
> df: org.apache.spark.sql.DataFrame = [UDF(s): string]
>
>
> scala> println(df.queryExecution.optimizedPlan)
>
> UDF evaled
>
> LocalRelation [UDF(s)#9]
>
>  This is somewhat unexpected to me because of Spark's lazy execution model.
>
> I am wondering if this behavior is by design?
>
> Thanks!
> Li
>
>
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Reynold Xin
+1

On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> Given that I expect at least a few people to be busy with Spark Summit next
> week, I'm taking the liberty of setting an extended voting period. The vote
> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>
> It passes with a majority of +1 votes, which must include at least 3 +1
> votes
> from the PMC.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> https://github.com/apache/spark/tree/v2.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1272/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Reynold Xin
Yes everybody please cc the release manager on changes that merit -1. It's
high overhead and let's make this smoother.


On Fri, Jun 1, 2018 at 1:28 PM Marcelo Vanzin  wrote:

> Xiao,
>
> This is the third time in this release cycle that this is happening.
> Sorry to single out you guys, but can you please do two things:
>
> - do not merge things in 2.3 you're not absolutely sure about
> - make sure that things you backport to 2.3 are not causing problems
> - let the RM know about these things as soon as you discover them, not
> when they send the next RC for voting.
>
> Even though I was in the middle of preparing the rc, I could have
> easily aborted that and skipped this whole thread.
>
> This vote is canceled. I'll prepare a new RC right away. I hope this
> does not happen again.
>
>
> On Fri, Jun 1, 2018 at 1:20 PM, Xiao Li  wrote:
> > Sorry, I need to say -1
> >
> > This morning, just found a regression in 2.3.1 and reverted
> > https://github.com/apache/spark/pull/21443
> >
> > Xiao
> >
> > 2018-06-01 13:09 GMT-07:00 Marcelo Vanzin :
> >>
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.3.1.
> >>
> >> Given that I expect at least a few people to be busy with Spark Summit
> >> next
> >> week, I'm taking the liberty of setting an extended voting period. The
> >> vote
> >> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
> >>
> >> It passes with a majority of +1 votes, which must include at least 3 +1
> >> votes
> >> from the PMC.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.3.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.3.1-rc3 (commit 1cc5f68b):
> >> https://github.com/apache/spark/tree/v2.3.1-rc3
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc3-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1271/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc3-docs/
> >>
> >> The list of bug fixes going into 2.3.1 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.3.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.3.1 can be found at:
> >> https://s.apache.org/Q3Uo
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Reynold Xin
SQL expressions?

On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski  wrote:

> Hi,
>
> I've been exploring RuntimeReplaceable expressions [1] and have been
> wondering what their purpose is.
>
> Quoting the scaladoc [2]:
>
> > An expression that gets replaced at runtime (currently by the optimizer)
> into a different expression for evaluation. This is mainly used to provide
> compatibility with other databases.
>
> For example, ParseToTimestamp expression is a RuntimeReplaceable
> expression and it is replaced by Cast(left, TimestampType)
> or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp
> function (there are two variants).
>
> My question is why is this RuntimeReplaceable better than simply using the
> Casts as the implementation of to_timestamp functions?
>
> def to_timestamp(s: Column, fmt: String): Column = withExpr {
>   // pseudocode
>   Cast(UnixTimestamp(left, format), TimestampType)
> }
>
> What's wrong with the above implementation compared to the current one?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275
>
> [2]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>


Re: Running lint-java during PR builds?

2018-05-21 Thread Reynold Xin
Can we look into if there is a plugin for sbt that works and then we can
put everything into one single builder?

On Mon, May 21, 2018 at 11:17 AM Dongjoon Hyun 
wrote:

> Thank you for reconsidering this, Hyukjin. :)
>
> Bests,
> Dongjoon.
>
>
> On Mon, May 21, 2018 at 9:20 AM, Marcelo Vanzin 
> wrote:
>
>> Is there a way to trigger it conditionally? e.g. only if the diff
>> touches java files.
>>
>> On Mon, May 21, 2018 at 9:17 AM, Felix Cheung 
>> wrote:
>> > One concern is with the volume of test runs on Travis.
>> >
>> > In ASF projects Travis could get significantly
>> > backed up since - if I recall - all of ASF shares one queue.
>> >
>> > At the number of PRs Spark has this could be a big issue.
>> >
>> >
>> > 
>> > From: Marcelo Vanzin 
>> > Sent: Monday, May 21, 2018 9:08:28 AM
>> > To: Hyukjin Kwon
>> > Cc: Dongjoon Hyun; dev
>> > Subject: Re: Running lint-java during PR builds?
>> >
>> > I'm fine with it. I tried to use the existing checkstyle sbt plugin
>> > (trying to fix SPARK-22269), but it depends on an ancient version of
>> > checkstyle, and I don't know sbt enough to figure out how to hack
>> > classpaths and class loaders when applying rules, so gave up.
>> >
>> > On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon 
>> wrote:
>> >> I am going to open an INFRA JIRA if there's no explicit objection in
>> few
>> >> days.
>> >>
>> >> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon :
>> >>>
>> >>> I would like to revive this proposal. Travis CI. Shall we give this
>> try?
>> >>> I
>> >>> think it's worth trying it.
>> >>>
>> >>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun :
>> 
>>  Hi, Marcelo and Ryan.
>> 
>>  That was the main purpose of my proposal about Travis.CI.
>>  IMO, that is the only way to achieve that without any harmful
>>  side-effect
>>  on Jenkins infra.
>> 
>>  Spark is already ready for that. Like AppVoyer, if one of you files
>> an
>>  INFRA jira issue to enable that, they will turn on that. Then, we can
>>  try it
>>  and see the result. Also, you can turn off easily again if you don't
>>  want.
>> 
>>  Without this, we will consume more community efforts. For example, we
>>  merged lint-java error fix PR seven hours ago, but the master branch
>>  still
>>  has one lint-java error.
>> 
>>  https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
>> 
>>  Actually, I've been monitoring the history here. (It's synced every
>> 30
>>  minutes.)
>> 
>>  https://travis-ci.org/dongjoon-hyun/spark/builds
>> 
>>  Could we give a change to this?
>> 
>>  Bests,
>>  Dongjoon.
>> 
>>  On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
>>   wrote:
>>  > I remember it's because you need to run `mvn install` before
>> running
>>  > lint-java if the maven cache is empty, and `mvn install` is pretty
>>  > heavy.
>>  >
>>  > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin <
>> van...@cloudera.com>
>>  > wrote:
>>  >
>>  > > Hey all,
>>  > >
>>  > > Is there a reason why lint-java is not run during PR builds? I
>> see
>>  > > it
>>  > > seems to be maven-only, is it really expensive to run after an
>> sbt
>>  > > build?
>>  > >
>>  > > I see a lot of PRs coming in to fix Java style issues, and those
>> all
>>  > > seem a little unnecessary. Either we're enforcing style checks or
>>  > > we're not, and right now it seems we aren't.
>>  > >
>>  > > --
>>  > > Marcelo
>>  > >
>>  > >
>>  > >
>> -
>>  > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>  > >
>>  > >
>>  >
>> 
>>  -
>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Marcelo
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>


parser error?

2018-05-13 Thread Reynold Xin
Just saw this in one of my PR that's doc only:

[error] warning(154): SqlBase.g4:400:0: rule fromClause contains an
optional block with at least one alternative that can match an empty
string


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yes would be great if possible but it’s non trivial (might be impossible to
do in general; we already have stacktraces that point to line numbers when
an error occur in UDFs but clearly that’s not sufficient). Also in
environments like REPL it’s still more useful to show error as soon as it
occurs, rather than showing it potentially 30 lines later.

On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> This may be technically impractical, but it would be fantastic if we could
> make it easier to debug Spark programs without needing to rely on eager
> execution. Sprinkling .count() and .checkpoint() at various points in my
> code is still a debugging technique I use, but it always makes me wish
> Spark could point more directly to the offending transformation when
> something goes wrong.
>
> Is it somehow possible to have each individual operator (is that the
> correct term?) in a DAG include metadata pointing back to the line of code
> that generated the operator? That way when an action triggers an error, the
> failing operation can point to the relevant line of code — even if it’s a
> transformation — and not just the action on the tail end that triggered the
> error.
>
> I don’t know how feasible this is, but addressing it would directly solve
> the issue of linking failures to the responsible transformation, as opposed
> to leaving the user to break up a chain of transformations with several
> debug actions. And this would benefit new and experienced users alike.
>
> Nick
>
> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
> <http://mailto:rb...@netflix.com.invalid>님이 작성:
>
> I've opened SPARK-24215 to track this.
>>
>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Yup. Sounds great. This is something simple Spark can do and provide
>>> huge value to the end users.
>>>
>>>
>>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Would be great if it is something more turn-key.
>>>>
>>>> We can easily add the __repr__ and _repr_html_ methods and behavior to
>>>> PySpark classes. We could also add a configuration property to determine
>>>> whether the dataset evaluation is eager or not. That would make it turn-key
>>>> for anyone running PySpark in Jupyter.
>>>>
>>>> For JVM languages, we could also add a dependency on jvm-repr and do
>>>> the same thing.
>>>>
>>>> rb
>>>> ​
>>>>
>>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> s/underestimated/overestimated/
>>>>>
>>>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Marco,
>>>>>>
>>>>>> There is understanding how Spark works, and there is finding bugs
>>>>>> early in their own program. One can perfectly understand how Spark works
>>>>>> and still find it valuable to get feedback asap, and that's why we built
>>>>>> eager analysis in the first place.
>>>>>>
>>>>>> Also I'm afraid you've significantly underestimated the level of
>>>>>> technical sophistication of users. In many cases they struggle to get
>>>>>> anything to work, and performance optimization of their programs is
>>>>>> secondary to getting things working. As John Ousterhout says, "the 
>>>>>> greatest
>>>>>> performance improvement of all is when a system goes from not-working to
>>>>>> working".
>>>>>>
>>>>>> I really like Ryan's approach. Would be great if it is something more
>>>>>> turn-key.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not sure how this is useful. For students, it is important to
>>>>>>> understand how Spark works. This can be critical in many decision they 
>>>>>>> have
>>>>>>> to take (whether and what to cache for instance) in order to have
>>>>>>> performant Spark application. Creating a eager execution probably can 
>>>>>>> help

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
IIRC we switched all internals to UnsafeRow for simplicity. It is easier to
serialize UnsafeRows, compute hash codes, etc. At some point we had a bug
with unioning two plans producing different types of rows, so we forced the
conversion at input.

Can't your "wish" be satisfied by having the public API producing the
internals of UnsafeRow (without actually exposing UnsafeRow)?


On Tue, May 8, 2018 at 4:16 PM Ryan Blue <rb...@netflix.com> wrote:

> Is the goal to design an API so the consumers of the API can directly
> produces what Spark expects internally, to cut down perf cost?
>
> No. That has already been done. The problem on the API side is that it
> makes little sense to force implementers to create UnsafeRow when it almost
> certainly causes them to simply use UnsafeProjection and copy it. If
> we’re just making a copy and we can defer that copy to get better
> performance, why would we make implementations handle it? Instead, I think
> we should accept InternalRow from v2 data sources and copy to unsafe when
> it makes sense to do so: after filters are run and only if there isn’t
> another projection that will do it already.
>
> But I don’t want to focus on the v2 API for this. What I’m asking in this
> thread is what the intent is for the SQL engine. Is this an accident that
> nearly everything works with InternalRow? If we were to make a choice
> here, should we mandate that rows passed into the SQL engine must be
> UnsafeRow?
>
> Personally, I think it makes sense to say that everything should accept
> InternalRow, but produce UnsafeRow, with the understanding that UnsafeRow
> will usually perform better.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 4:09 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> What the internal operators do are strictly internal. To take one step
>> back, is the goal to design an API so the consumers of the API can directly
>> produces what Spark expects internally, to cut down perf cost?
>>
>>
>> On Tue, May 8, 2018 at 1:22 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> While moving the new data source API to InternalRow, I noticed a few odd
>>> things:
>>>
>>>- Spark scans always produce UnsafeRow, but that data is passed
>>>around as InternalRow with explicit casts.
>>>- Operators expect InternalRow and nearly all codegen works with
>>>InternalRow (I’ve tested this with quite a few queries.)
>>>- Spark uses unchecked casts
>>>
>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L254>
>>>from InternalRow to UnsafeRow in places, assuming that data will be
>>>unsafe, even though that isn’t what the type system guarantees.
>>>
>>> To me, it looks like the idea was to code SQL operators to the abstract
>>> InternalRow so we can swap out the implementation, but ended up with a
>>> general assumption that rows will always be unsafe. This is the worst of
>>> both options: we can’t actually rely on everything working with
>>> InternalRow but code must still use it, until it is inconvenient and an
>>> unchecked cast gets inserted.
>>>
>>> The main question I want to answer is this: *what data format should
>>> SQL use internally?* What was the intent when building catalyst?
>>>
>>> The v2 data source API depends on the answer, but I also found that this
>>> introduces a significant performance penalty in Parquet (and probably other
>>> formats). A quick check on one of our tables showed a 6% performance hit
>>> caused by unnecessary copies from InternalRow to UnsafeRow. So if we
>>> can guarantee that all operators should support InternalRow, then there
>>> is an easy performance win that also simplifies the v2 data source API.
>>>
>>> rb
>>> ​
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
What the internal operators do are strictly internal. To take one step
back, is the goal to design an API so the consumers of the API can directly
produces what Spark expects internally, to cut down perf cost?


On Tue, May 8, 2018 at 1:22 PM Ryan Blue  wrote:

> While moving the new data source API to InternalRow, I noticed a few odd
> things:
>
>- Spark scans always produce UnsafeRow, but that data is passed around
>as InternalRow with explicit casts.
>- Operators expect InternalRow and nearly all codegen works with
>InternalRow (I’ve tested this with quite a few queries.)
>- Spark uses unchecked casts
>
> 
>from InternalRow to UnsafeRow in places, assuming that data will be
>unsafe, even though that isn’t what the type system guarantees.
>
> To me, it looks like the idea was to code SQL operators to the abstract
> InternalRow so we can swap out the implementation, but ended up with a
> general assumption that rows will always be unsafe. This is the worst of
> both options: we can’t actually rely on everything working with
> InternalRow but code must still use it, until it is inconvenient and an
> unchecked cast gets inserted.
>
> The main question I want to answer is this: *what data format should SQL
> use internally?* What was the intent when building catalyst?
>
> The v2 data source API depends on the answer, but I also found that this
> introduces a significant performance penalty in Parquet (and probably other
> formats). A quick check on one of our tables showed a 6% performance hit
> caused by unnecessary copies from InternalRow to UnsafeRow. So if we can
> guarantee that all operators should support InternalRow, then there is an
> easy performance win that also simplifies the v2 data source API.
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yup. Sounds great. This is something simple Spark can do and provide huge
value to the end users.


On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote:

> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior to
> PySpark classes. We could also add a configuration property to determine
> whether the dataset evaluation is eager or not. That would make it turn-key
> for anyone running PySpark in Jupyter.
>
> For JVM languages, we could also add a dependency on jvm-repr and do the
> same thing.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> s/underestimated/overestimated/
>>
>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Marco,
>>>
>>> There is understanding how Spark works, and there is finding bugs early
>>> in their own program. One can perfectly understand how Spark works and
>>> still find it valuable to get feedback asap, and that's why we built eager
>>> analysis in the first place.
>>>
>>> Also I'm afraid you've significantly underestimated the level of
>>> technical sophistication of users. In many cases they struggle to get
>>> anything to work, and performance optimization of their programs is
>>> secondary to getting things working. As John Ousterhout says, "the greatest
>>> performance improvement of all is when a system goes from not-working to
>>> working".
>>>
>>> I really like Ryan's approach. Would be great if it is something more
>>> turn-key.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure how this is useful. For students, it is important to
>>>> understand how Spark works. This can be critical in many decision they have
>>>> to take (whether and what to cache for instance) in order to have
>>>> performant Spark application. Creating a eager execution probably can help
>>>> them having something running more easily, but let them also using Spark
>>>> knowing less about how it works, thus they are likely to write worse
>>>> application and to have more problems in debugging any kind of problem
>>>> which may later (in production) occur (therefore affecting their experience
>>>> with the tool).
>>>>
>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>>>> execution, helping in the debugging phase. So they can achieve without a
>>>> big effort the same result, but with a big difference: they are aware of
>>>> what is really happening, which may help them later.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>>>>
>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>>>> sessions. For anyone interested, this mode of interaction is really easy 
>>>>> to
>>>>> add in Jupyter and PySpark. You would just define a different
>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>>>> take(100) and formats the result.
>>>>>
>>>>> That way, the output of a cell or console execution always causes the
>>>>> dataframe to run and get displayed for that immediate feedback. But, there
>>>>> is no change to Spark’s behavior because the action is run by the REPL, 
>>>>> and
>>>>> only when a dataframe is a result of an execution in order to display it.
>>>>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>>>>> too many executions and would still support method chaining in the
>>>>> dataframe API (which would be horrible with an aggressive execution 
>>>>> model).
>>>>>
>>>>> There are ways to do this in JVM languages as well if you are using a
>>>>> Scala or Java interpreter (see jvm-repr
>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what we do
>>>>> in our Spark-based SQL interpreter to display results.
>>>>>
>>>>> rb
>>>>> ​
>>>>>
>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
&

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/

On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote:

> Marco,
>
> There is understanding how Spark works, and there is finding bugs early in
> their own program. One can perfectly understand how Spark works and still
> find it valuable to get feedback asap, and that's why we built eager
> analysis in the first place.
>
> Also I'm afraid you've significantly underestimated the level of technical
> sophistication of users. In many cases they struggle to get anything to
> work, and performance optimization of their programs is secondary to
> getting things working. As John Ousterhout says, "the greatest performance
> improvement of all is when a system goes from not-working to working".
>
> I really like Ryan's approach. Would be great if it is something more
> turn-key.
>
>
>
>
>
>
> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> wrote:
>
>> I am not sure how this is useful. For students, it is important to
>> understand how Spark works. This can be critical in many decision they have
>> to take (whether and what to cache for instance) in order to have
>> performant Spark application. Creating a eager execution probably can help
>> them having something running more easily, but let them also using Spark
>> knowing less about how it works, thus they are likely to write worse
>> application and to have more problems in debugging any kind of problem
>> which may later (in production) occur (therefore affecting their experience
>> with the tool).
>>
>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>> execution, helping in the debugging phase. So they can achieve without a
>> big effort the same result, but with a big difference: they are aware of
>> what is really happening, which may help them later.
>>
>> Thanks,
>> Marco
>>
>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>>
>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>> sessions. For anyone interested, this mode of interaction is really easy to
>>> add in Jupyter and PySpark. You would just define a different
>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>> take(100) and formats the result.
>>>
>>> That way, the output of a cell or console execution always causes the
>>> dataframe to run and get displayed for that immediate feedback. But, there
>>> is no change to Spark’s behavior because the action is run by the REPL, and
>>> only when a dataframe is a result of an execution in order to display it.
>>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>>> too many executions and would still support method chaining in the
>>> dataframe API (which would be horrible with an aggressive execution model).
>>>
>>> There are ways to do this in JVM languages as well if you are using a
>>> Scala or Java interpreter (see jvm-repr
>>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
>>> our Spark-based SQL interpreter to display results.
>>>
>>> rb
>>> ​
>>>
>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> yeah we run into this all the time with new hires. they will send
>>>> emails explaining there is an error in the .write operation and they are
>>>> debugging the writing to disk, focusing on that piece of code :)
>>>>
>>>> unrelated, but another frequent cause for confusion is cascading
>>>> errors. like the FetchFailedException. they will be debugging the reducer
>>>> task not realizing the error happened before that, and the
>>>> FetchFailedException is not the root cause.
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>>>> sending another email on what I've learned recently from Spark users. I
>>>>> recently talked to some educators that have been teaching Spark in their
>>>>> (top-tier) university classes. They are some of the most important users
>>>>> for adoption because of the multiplicative effect they have on the future
>>>>> generation.
>>>>>
>>>>> To my surprise the single biggest ask they want is to enable eager
>>>>> execution mode on 

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Marco,

There is understanding how Spark works, and there is finding bugs early in
their own program. One can perfectly understand how Spark works and still
find it valuable to get feedback asap, and that's why we built eager
analysis in the first place.

Also I'm afraid you've significantly underestimated the level of technical
sophistication of users. In many cases they struggle to get anything to
work, and performance optimization of their programs is secondary to
getting things working. As John Ousterhout says, "the greatest performance
improvement of all is when a system goes from not-working to working".

I really like Ryan's approach. Would be great if it is something more
turn-key.






On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> wrote:

> I am not sure how this is useful. For students, it is important to
> understand how Spark works. This can be critical in many decision they have
> to take (whether and what to cache for instance) in order to have
> performant Spark application. Creating a eager execution probably can help
> them having something running more easily, but let them also using Spark
> knowing less about how it works, thus they are likely to write worse
> application and to have more problems in debugging any kind of problem
> which may later (in production) occur (therefore affecting their experience
> with the tool).
>
> Moreover, as Ryan also mentioned, there are tools/ways to force the
> execution, helping in the debugging phase. So they can achieve without a
> big effort the same result, but with a big difference: they are aware of
> what is really happening, which may help them later.
>
> Thanks,
> Marco
>
> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>
>> At Netflix, we use Jupyter notebooks and consoles for interactive
>> sessions. For anyone interested, this mode of interaction is really easy to
>> add in Jupyter and PySpark. You would just define a different *repr_html*
>> or *repr* method for Dataset that runs a take(10) or take(100) and
>> formats the result.
>>
>> That way, the output of a cell or console execution always causes the
>> dataframe to run and get displayed for that immediate feedback. But, there
>> is no change to Spark’s behavior because the action is run by the REPL, and
>> only when a dataframe is a result of an execution in order to display it.
>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>> too many executions and would still support method chaining in the
>> dataframe API (which would be horrible with an aggressive execution model).
>>
>> There are ways to do this in JVM languages as well if you are using a
>> Scala or Java interpreter (see jvm-repr
>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
>> our Spark-based SQL interpreter to display results.
>>
>> rb
>> ​
>>
>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> yeah we run into this all the time with new hires. they will send emails
>>> explaining there is an error in the .write operation and they are debugging
>>> the writing to disk, focusing on that piece of code :)
>>>
>>> unrelated, but another frequent cause for confusion is cascading errors.
>>> like the FetchFailedException. they will be debugging the reducer task not
>>> realizing the error happened before that, and the FetchFailedException is
>>> not the root cause.
>>>
>>>
>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>>> sending another email on what I've learned recently from Spark users. I
>>>> recently talked to some educators that have been teaching Spark in their
>>>> (top-tier) university classes. They are some of the most important users
>>>> for adoption because of the multiplicative effect they have on the future
>>>> generation.
>>>>
>>>> To my surprise the single biggest ask they want is to enable eager
>>>> execution mode on all operations for teaching and debuggability:
>>>>
>>>> (1) Most of the students are relatively new to programming, and they
>>>> need multiple iterations to even get the most basic operation right. In
>>>> these cases, in order to trigger an error, they would need to explicitly
>>>> add actions, which is non-intuitive.
>>>>
>>>> (2) If they don't add explicit actions to every operation and there is
>>>> a mis

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm
sending another email on what I've learned recently from Spark users. I
recently talked to some educators that have been teaching Spark in their
(top-tier) university classes. They are some of the most important users
for adoption because of the multiplicative effect they have on the future
generation.

To my surprise the single biggest ask they want is to enable eager
execution mode on all operations for teaching and debuggability:

(1) Most of the students are relatively new to programming, and they need
multiple iterations to even get the most basic operation right. In these
cases, in order to trigger an error, they would need to explicitly add
actions, which is non-intuitive.

(2) If they don't add explicit actions to every operation and there is a
mistake, the error pops up somewhere later where an action is triggered.
This is in a different position from the code that causes the problem, and
difficult for students to correlate the two.

I suspect in the real world a lot of Spark users also struggle in similar
ways as these students. While eager execution is really not practical in
big data, in learning environments or in development against small, sampled
datasets it can be pretty helpful.


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
Yes, Nan, totally agree. To be on the same page, that's exactly what I
wrote wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote:

> besides that, one of the things which is needed by multiple frameworks is
> to schedule tasks in a single wave
>
> i.e.
>
> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
> is desired to provide a capability to ensure that either we run 50 tasks at
> once, or we should quit the complete application/job after some timeout
> period
>
> Best,
>
> Nan
>
> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> I think that's what Xiangrui was referring to. Instead of retrying a
>> single task, retry the entire stage, and the entire stage of tasks need to
>> be scheduled all at once.
>>
>>
>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>>
>>>>
>>>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>>>doesn’t match the execution model of distributed ML/DL frameworks that 
>>>>> are
>>>>>typically MPI-based, and rerunning a single task would lead to the 
>>>>> entire
>>>>>system hanging. A whole stage needs to be re-run.
>>>>>
>>>>> This is not only useful for integrating with 3rd-party frameworks, but
>>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>> with some compromised solutions. With the new execution model, we can set
>>>> up a hybrid cluster and do all-reduce properly.
>>>>
>>>>
>>> Is there a particular new execution model you are referring to or do we
>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>> and I dont think we have support for that in the scheduler right now.
>>>
>>>>
>>>>> --
>>>>
>>>> Xiangrui Meng
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
I think that's what Xiangrui was referring to. Instead of retrying a single
task, retry the entire stage, and the entire stage of tasks need to be
scheduled all at once.


On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

>
>>
>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>doesn’t match the execution model of distributed ML/DL frameworks that 
>>> are
>>>typically MPI-based, and rerunning a single task would lead to the entire
>>>system hanging. A whole stage needs to be re-run.
>>>
>>> This is not only useful for integrating with 3rd-party frameworks, but
>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>> ). But we ended up
>> with some compromised solutions. With the new execution model, we can set
>> up a hybrid cluster and do all-reduce properly.
>>
>>
> Is there a particular new execution model you are referring to or do we
> plan to investigate a new execution model ?  For the MPI-like model, we
> also need gang scheduling (i.e. schedule all tasks at once or none of them)
> and I dont think we have support for that in the scheduler right now.
>
>>
>>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com] 
>>
>
>


Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Reynold Xin
Would be great to document. Probably best with examples.

On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas 
wrote:

> The documentation for DataFrame.join()
> 
> lists all the join types we support:
>
>- inner
>- cross
>- outer
>- full
>- full_outer
>- left
>- left_outer
>- right
>- right_outer
>- left_semi
>- left_anti
>
> Some of these join types are also listed on the SQL Programming Guide
> 
> .
>
> Is it obvious to everyone what all these different join types are? For
> example, I had never heard of a LEFT ANTI join until stumbling on it in the
> PySpark docs. It’s quite handy! But I had to experiment with it a bit just
> to understand what it does.
>
> I think it would be a good service to our users if we either documented
> these join types ourselves clearly, or provided a link to an external
> resource that documented them sufficiently. I’m happy to file a JIRA about
> this and do the work itself. It would be great if the documentation could
> be expressed as a series of simple doc tests, but brief prose describing
> how each join works would still be valuable.
>
> Does this seem worthwhile to folks here? And does anyone want to offer
> guidance on how best to provide this kind of documentation so that it’s
> easy to find by users, regardless of the language they’re using?
>
> Nick
> ​
>


Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
I don't think it's sufficient to have them in YARN (or any other services)
without Spark aware of them. If Spark is not aware of them, then there is
no way to really efficiently utilize these accelerators when you run
anything that require non-accelerators (which is almost 100% of the cases
in real world workloads).

For the other two, the point is not to implement all the ML/DL algorithms
in Spark, but make Spark integrate well with ML/DL frameworks. Otherwise
you will have the problems I described (super low performance when
exchanging data between Spark and ML/DL frameworks, and hanging issues with
MPI-based programs).


On Mon, May 7, 2018 at 10:05 PM Jörn Franke <jornfra...@gmail.com> wrote:

> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA
> scheduling, so it might be worth to have the last point generic that not
> only the Spark scheduler, but all supported schedulers can use GPU.
>
> For the other 2 points I just wonder if it makes sense to address this in
> the ml frameworks themselves or in Spark.
>
> On 8. May 2018, at 06:59, Xiangrui Meng <m...@databricks.com> wrote:
>
> Thanks Reynold for summarizing the offline discussion! I added a few
> comments inline. -Xiangrui
>
> On Mon, May 7, 2018 at 5:37 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Hi all,
>>
>> Xiangrui and I were discussing with a heavy Apache Spark user last week
>> on their experiences integrating machine learning (and deep learning)
>> frameworks with Spark and some of their pain points. Couple things were
>> obvious and I wanted to share our learnings with the list.
>>
>> (1) Most organizations already use Spark for data plumbing and want to be
>> able to run their ML part of the stack on Spark as well (not necessarily
>> re-implementing all the algorithms but by integrating various frameworks
>> like tensorflow, mxnet with Spark).
>>
>> (2) The integration is however painful, from the systems perspective:
>>
>>
>>- Performance: data exchange between Spark and other frameworks are
>>slow, because UDFs across process boundaries (with native code) are slow.
>>This works much better now with Pandas UDFs (given a lot of the ML/DL
>>frameworks are in Python). However, there might be some low hanging fruit
>>gaps here.
>>
>> The Arrow support behind Pands UDFs can be reused to exchange data with
> other frameworks. And one possibly performance improvement is to support
> pipelining when supplying data to other frameworks. For example, while
> Spark is pumping data from external sources into TensorFlow, TensorFlow
> starts the computation on GPUs. This would significant improve speed and
> resource utilization.
>
>>
>>- Fault tolerance and execution model: Spark assumes fine-grained
>>task recovery, i.e. if something fails, only that task is rerun. This
>>doesn’t match the execution model of distributed ML/DL frameworks that are
>>typically MPI-based, and rerunning a single task would lead to the entire
>>system hanging. A whole stage needs to be re-run.
>>
>> This is not only useful for integrating with 3rd-party frameworks, but
> also useful for scaling MLlib algorithms. One of my earliest attempts in
> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up with
> some compromised solutions. With the new execution model, we can set up a
> hybrid cluster and do all-reduce properly.
>
>
>>
>>- Accelerator-aware scheduling: The DL frameworks leverage GPUs and
>>sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
>>aware of those resources, leading to either over-utilizing the 
>> accelerators
>>or under-utilizing the CPUs.
>>
>>
>> The good thing is that none of these seem very difficult to address (and
>> we have already made progress on one of them). Xiangrui has graciously
>> accepted the challenge to come up with solutions and SPIP to these.
>>
>>
> I will do more home work, exploring existing JIRAs or creating new JIRAs
> for the proposal. We'd like to hear your feedback and past efforts along
> those directions if they were not fully captured by our JIRA.
>
>
>> Xiangrui - please also chime in if I didn’t capture everything.
>>
>>
>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>
>


Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
Hi all,

Xiangrui and I were discussing with a heavy Apache Spark user last week on
their experiences integrating machine learning (and deep learning)
frameworks with Spark and some of their pain points. Couple things were
obvious and I wanted to share our learnings with the list.

(1) Most organizations already use Spark for data plumbing and want to be
able to run their ML part of the stack on Spark as well (not necessarily
re-implementing all the algorithms but by integrating various frameworks
like tensorflow, mxnet with Spark).

(2) The integration is however painful, from the systems perspective:


   - Performance: data exchange between Spark and other frameworks are
   slow, because UDFs across process boundaries (with native code) are slow.
   This works much better now with Pandas UDFs (given a lot of the ML/DL
   frameworks are in Python). However, there might be some low hanging fruit
   gaps here.


   - Fault tolerance and execution model: Spark assumes fine-grained task
   recovery, i.e. if something fails, only that task is rerun. This doesn’t
   match the execution model of distributed ML/DL frameworks that are
   typically MPI-based, and rerunning a single task would lead to the entire
   system hanging. A whole stage needs to be re-run.


   - Accelerator-aware scheduling: The DL frameworks leverage GPUs and
   sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
   aware of those resources, leading to either over-utilizing the accelerators
   or under-utilizing the CPUs.


The good thing is that none of these seem very difficult to address (and we
have already made progress on one of them). Xiangrui has graciously
accepted the challenge to come up with solutions and SPIP to these.

Xiangrui - please also chime in if I didn’t capture everything.


Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-03 Thread Reynold Xin
Why do you need the underlying RDDs? Can't you just unpersist the
dataframes that you don't need?


On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas 
wrote:

> This seems to be an underexposed part of the API. My use case is this: I
> want to unpersist all DataFrames except a specific few. I want to do this
> because I know at a specific point in my pipeline that I have a handful of
> DataFrames that I need, and everything else is no longer needed.
>
> The problem is that there doesn’t appear to be a way to identify specific
> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
> which is the only way I’m aware of to ask Spark for all currently persisted
> RDDs:
>
> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
> [(3, JavaObject id=o36)]
>
> As you can see, the id of the persisted RDD, 8, doesn’t match the id
> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
> returned by getPersistentRDDs() and know which ones I want to keep.
>
> id() itself appears to be an undocumented method of the RDD API, and in
> PySpark getPersistentRDDs() is buried behind the Java sub-objects
> , so I know I’m
> reaching here. But is there a way to do what I want in PySpark without
> manually tracking everything I’ve persisted myself?
>
> And more broadly speaking, do we want to add additional APIs, or formalize
> currently undocumented APIs like id(), to make this use case possible?
>
> Nick
> ​
>


Re: Process for backports?

2018-04-24 Thread Reynold Xin
1. We don't backport features.

2. In general we don't bump dependencies, unless they are for critical bug
fixes.

3. We weight the risk of new regression vs bug fixes. To state the obvious,
we wouldn't backport a bug fix if it only affects a very small number of
use cases but require very complex changes. There is huge gray zone in
between here that'd rely on committer's judgement.





On Tue, Apr 24, 2018 at 3:56 PM, Cody Koeninger  wrote:

>  https://issues.apache.org/jira/browse/SPARK-24067
>
> is asking to backport a change to the 2.3 branch.
>
> My questions
>
> - In general are there any concerns about what qualifies for backporting?
> This adds a configuration variable but shouldn't change default behavior.
>
> - Is a separate jira + pr actually necessary?
> Seems like the merge_spark_pr.py script is set up to handle cherry
> picking the original merged PR in a case like this.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Reynold Xin
Perhaps we can just have a function that turns a DataFrame into a Column?
That'd work for both correlated and uncorrelated case, although in the
correlated case we'd need to turn off eager analysis (otherwise there is no
way to construct a valid DataFrame).


On Thu, Apr 19, 2018 at 4:08 PM, Ryan Blue 
wrote:

> Nick, thanks for raising this.
>
> It looks useful to have something in the DF API that behaves like
> sub-queries, but I’m not sure that passing a DF works. Making every method
> accept a DF that may contain matching data seems like it puts a lot of work
> on the API — which now has to accept a DF all over the place.
>
> What about exposing transforms that make it easy to coerce data to what
> the method needs? Instead of passing a dataframe, you’d pass df.toSet to
> isin:
>
> val subQ = spark.sql("select distinct filter_col from source")
> val df = table.filter($"col".isin(subQ.toSet))
>
> That also distinguishes between a sub-query and a correlated sub-query
> that uses values from the outer query. We would still need to come up with
> syntax for the correlated case, unless there’s a proposal already.
>
> rb
> ​
>
> On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I just submitted SPARK-23945
>>  but wanted to double
>> check here to make sure I didn't miss something fundamental.
>>
>> Correlated subqueries are tracked at a high level in SPARK-18455
>> , but it's not clear
>> to me whether they are "design-appropriate" for the DataFrame API.
>>
>> Are correlated subqueries a thing we can expect to have in the DataFrame
>> API?
>>
>> Nick
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Scala 2.12 support

2018-04-19 Thread Reynold Xin
Forking the thread to focus on Scala 2.12.

Dean,

There are couple different issues with Scala 2.12 (closure cleaner, API
breaking changes). Which one do you think we can address with a Scala
upgrade? (The closure cleaner one I haven't spent a lot of time looking at
it but it might involve more Spark side changes)

On Thu, Apr 19, 2018 at 3:28 AM, Dean Wampler  wrote:

> I spoke with Martin Odersky and Lightbend's Scala Team about the known API
> issue with method disambiguation. They offered to implement a small patch
> in a new release of Scala 2.12 to handle the issue without requiring a
> Spark API change. They would cut a 2.12.6 release for it. I'm told that
> Scala 2.13 should already handle the issue without modification (it's not
> yet released, to be clear). They can also offer feedback on updating the
> closure cleaner.
>
> So, this approach would support Scala 2.12 in Spark, but limited to
> 2.12.6+, without the API change requirement, but the closure cleaner would
> still need updating. Hence, it could be done for Spark 2.X.
>
> Let me if you want to pursue this approach.
>
> dean
>
>
>
>
> *Dean Wampler, Ph.D.*
>
> *VP, Fast Data Engineering at Lightbend*
> Author: Programming Scala, 2nd Edition
> , Fast Data
> Architectures for Streaming Applications
> ,
> and other content from O'Reilly
> @deanwampler 
> http://polyglotprogramming.com
> https://github.com/deanwampler
>
> On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin 
> wrote:
>
>> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia 
>> wrote:
>> > Sorry, but just to be clear here, this is the 2.12 API issue:
>> https://issues.apache.org/jira/browse/SPARK-14643, with more details in
>> this doc: https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HK
>> ixuNdxSEvo8nw_tgLgM/edit.
>> >
>> > Basically, if we are allowed to change Spark’s API a little to have
>> only one version of methods that are currently overloaded between Java and
>> Scala, we can get away with a single source three for all Scala versions
>> and Java ABI compatibility against any type of Spark (whether using Scala
>> 2.11 or 2.12).
>>
>> Fair enough. To play devil's advocate, most of those methods seem to
>> be marked "Experimental / Evolving", which could be used as a reason
>> to change them for this purpose in a minor release.
>>
>> Not all of them are, though (e.g. foreach / foreachPartition are not
>> experimental).
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Sorting on a streaming dataframe

2018-04-13 Thread Reynold Xin
Can you describe your use case more?

On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat 
wrote:

> Hi Guys,
>
> Why is sorting on streaming dataframes not supported(unless it is complete
> mode)? My downstream needs me to sort the streaming dataframe.
>
> Hemant
>


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Reynold Xin
Seems like this would make sense... we usually make maintenance releases
for bug fixes after a month anyway.


On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson  wrote:

>
>
> On 11 April 2018 at 12:47, Ryan Blue  wrote:
>
>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>> Spark.
>>
>> To be clear though, this only affects Spark when reading data written by
>> Impala, right? Or does Parquet CPP also produce data like this?
>>
>
> I don't know about parquet-cpp, but yeah, the only implementation I've
> seen writing the half-completed stats is Impala. (as you know, that's
> compliant with the spec, just an unusual choice).
>
>
>>
>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson 
>> wrote:
>>
>>> Hi all -
>>>
>>> SPARK-23852 (where a query can silently give wrong results thanks to a
>>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>> I've been involved with, we've released maintenance releases for bugs of
>>> this severity.
>>>
>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
>>> any consensus over whether we should consider (at least) a 2.3.1.
>>>
>>> The reason this particular issue is a bit tricky is that the Parquet
>>> community haven't yet produced a maintenance release that fixes the
>>> underlying bug, but they are in the process of releasing a new minor
>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>> developers, they'd be willing to consider a maintenance release, but would
>>> probably only bother if we (or another affected project) asked them to.
>>>
>>> My guess is that we wouldn't want to upgrade to a new minor version of
>>> Parquet for a Spark maintenance release, so asking for a Parquet
>>> maintenance release makes sense.
>>>
>>> What does everyone think?
>>>
>>> Best,
>>> Henry
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


time for Apache Spark 3.0?

2018-04-04 Thread Reynold Xin
There was a discussion thread on scala-contributors

about Apache Spark not yet supporting Scala 2.12, and that got me to think
perhaps it is about time for Spark to work towards the 3.0 release. By the
time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more
context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to
3.0).

3. That said, a major version isn’t necessarily the playground for
disruptive API changes to make it painful for users to update. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing
major new features incrementally, so major releases are not the only time
for exciting new features. For example, the bulk of the work in the move
towards the DataFrame API was done in Spark 1.3, and Continuous Processing
was introduced in Spark 2.3. Both were feature releases rather than major
releases.


You can find more background in the thread discussing Spark 2.0:
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support
Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
Similar to Spark 2.0, I think there are also opportunities for other
changes that we know have been biting us for a long time but can’t be
changed in feature releases (to be clear, I’m actually not sure they are
all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant,
to prevent users from shooting themselves in the foot, e.g. “SELECT 2
SECOND” -- is “SECOND” an interval unit or an alias? To make it less
painful for users to upgrade here, I’d suggest creating a flag for backward
compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard
compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g.
“JavaPairRDD flatMapValues requires function returning Iterable, not
Iterator”, “Prevent column name duplication in temporary view”).


Now the reality of a major version bump is that the world often thinks in
terms of what exciting features are coming. I do think there are a number
of major changes happening already that can be part of the 3.0 release, if
they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is
realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...



Similar to the 2.0 discussion, this thread should focus on the framework
and whether it’d make sense to create Spark 3.0 as the next release, rather
than the individual feature requests. Those are important but are best done
in their own separate threads.


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Ah ok. Thanks for commenting. Everyday I learn something new about SQL.

For others to follow, SQL Server has a good explanation of the behavior:
https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
transact-sql


Can somebody (Li?) update the API documentation to specify the gotchas, in
case users are not familiar with SQL window function semantics?



General Remarks
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>

More than one window function can be used in a single query with a single
FROM clause. The OVER clause for each function can differ in partitioning
and ordering.

If PARTITION BY is not specified, the function treats all rows of the query
result set as a single group.
Important!
<https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>

If ROWS/RANGE is specified and  is used for  (short syntax) then this specification is used for the window
frame boundary starting point and CURRENT ROW is used for the boundary
ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS BETWEEN 5
PRECEDING AND CURRENT ROW”.

Note+

If ORDER BY is not specified entire partition is used for a window frame.
This applies only to functions that do not require ORDER BY clause. If
ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
PRECEDING AND CURRENT ROW is used as default for window frame. This applies
only to functions that have can accept optional ROWS/RANGE specification.
For example, ranking functions cannot accept ROWS/RANGE, therefore this
window frame is not applied even though ORDER BY is present and ROWS/RANGE
is not.





On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> This is actually by design, without a `ORDER BY` clause, all rows are
> considered as the peer row of the current row, which means that the frame
> is effectively the entire partition. This behavior follows the window
> syntax of PGSQL.
> You can refer to the comment by yhuai: https://github.com/apac
> he/spark/pull/5604#discussion_r157931911
> :)
>
> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>:
>
>> Do other (non-Hive) SQL systems do the same thing?
>>
>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>> her...@databricks.com> wrote:
>>
>>> This is something we inherited from Hive: https://cwiki.apache.org
>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>
>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>> CURRENT ROW.
>>>
>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>> UNBOUNDED FOLLOWING.
>>>
>>>
>>> It sort of makes sense if you think about it. If there is no ordering
>>> there is no way to have a bound frame. If there is ordering we default to
>>> the most commonly used deterministic frame.
>>>
>>>
>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Seems like a bug.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote:
>>>>
>>>>> Hi Devs,
>>>>>
>>>>> I am seeing some behavior with window functions that is a bit
>>>>> unintuitive and would like to get some clarification.
>>>>>
>>>>> When using aggregation function with window, the frame boundary seems
>>>>> to change depending on the order of the window.
>>>>>
>>>>> Example:
>>>>> (1)
>>>>>
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w1 = Window.partitionBy('id')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|2.0|
>>>>>
>>>>> |  0|  2|2.0|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> (2)
>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>
>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>
>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> | id|  v| v2|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> |  0|  1|1.0|
>>>>>
>>>>> |  0|  2|1.5|
>>>>>
>>>>> |  0|  3|2.0|
>>>>>
>>>>> +---+---+---+
>>>>>
>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>> boundaries from (
>>>>>
>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>> currentRow).
>>>>>
>>>>>
>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>>> by design and if so, what's the specific rule that orderBy() interacts 
>>>>> with
>>>>> frame boundaries?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Li
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Do other (non-Hive) SQL systems do the same thing?

On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
her...@databricks.com> wrote:

> This is something we inherited from Hive: https://cwiki.apache.
> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>
> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>> ROW.
>
> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>> FOLLOWING.
>
>
> It sort of makes sense if you think about it. If there is no ordering
> there is no way to have a bound frame. If there is ordering we default to
> the most commonly used deterministic frame.
>
>
> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Seems like a bug.
>>
>>
>>
>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> I am seeing some behavior with window functions that is a bit
>>> unintuitive and would like to get some clarification.
>>>
>>> When using aggregation function with window, the frame boundary seems to
>>> change depending on the order of the window.
>>>
>>> Example:
>>> (1)
>>>
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w1 = Window.partitionBy('id')
>>>
>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|2.0|
>>>
>>> |  0|  2|2.0|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> (2)
>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>
>>> w2 = Window.partitionBy('id').orderBy('v')
>>>
>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>
>>> +---+---+---+
>>>
>>> | id|  v| v2|
>>>
>>> +---+---+---+
>>>
>>> |  0|  1|1.0|
>>>
>>> |  0|  2|1.5|
>>>
>>> |  0|  3|2.0|
>>>
>>> +---+---+---+
>>>
>>> Seems like orderBy('v') in the example (2) also changes the frame
>>> boundaries from (
>>>
>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>> currentRow).
>>>
>>>
>>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>>> design and if so, what's the specific rule that orderBy() interacts with
>>> frame boundaries?
>>>
>>>
>>> Thanks,
>>>
>>> Li
>>>
>>>
>>
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Reynold Xin
Seems like a bug.



On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:

> Hi Devs,
>
> I am seeing some behavior with window functions that is a bit unintuitive
> and would like to get some clarification.
>
> When using aggregation function with window, the frame boundary seems to
> change depending on the order of the window.
>
> Example:
> (1)
>
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w1 = Window.partitionBy('id')
>
> df.withColumn('v2', mean(df.v).over(w1)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|2.0|
>
> |  0|  2|2.0|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> (2)
> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>
> w2 = Window.partitionBy('id').orderBy('v')
>
> df.withColumn('v2', mean(df.v).over(w2)).show()
>
> +---+---+---+
>
> | id|  v| v2|
>
> +---+---+---+
>
> |  0|  1|1.0|
>
> |  0|  2|1.5|
>
> |  0|  3|2.0|
>
> +---+---+---+
>
> Seems like orderBy('v') in the example (2) also changes the frame
> boundaries from (
>
> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
> currentRow).
>
>
> I found this behavior a bit unintuitive. I wonder if this behavior is by
> design and if so, what's the specific rule that orderBy() interacts with
> frame boundaries?
>
>
> Thanks,
>
> Li
>
>


Re: [build system] experiencing network issues, git fetch timeouts likely

2018-04-02 Thread Reynold Xin
Thanks Shane for taking care of this!

On Mon, Apr 2, 2018 at 9:12 PM shane knapp  wrote:

> the problem was identified and fixed, and we should be good as of about an
> hour ago.
>
> sorry for any inconvenience!
>
> On Mon, Apr 2, 2018 at 4:15 PM, shane knapp  wrote:
>
>> hey all!
>>
>> we're having network issues on campus right now, and the jenkins workers
>> are experiencing up to 40% packet loss on our pings to github.
>>
>> this can cause builds to time out when attempting to git fetch.
>>
>> i'll post an update on the network status when i find out more about
>> what's going on.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
Is it difficult to upgrade Hive execution version to the latest version?
The metastore used to be an issue but now that part had been separated from
the execution part.


On Mon, Apr 2, 2018 at 1:57 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Saisai filed SPARK-23534, but the main blocking issue is really
> SPARK-18673.
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <r...@databricks.com> wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>
>
>
> --
> Marcelo
>


Re: Hadoop 3 support

2018-04-02 Thread Reynold Xin
That's just a nice to have improvement right? I'm more curious what is the
minimal amount of work required to support 3.0, without all the bells and
whistles. (Of course we can also do the bells and whistles, but those would
come after we can actually get 3.0 running).


On Mon, Apr 2, 2018 at 1:50 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> Specifically to run spark with hadoop 3 docker support, I have filed a
> few jira's tracked under [1].
>
> Regards,
> Mridul
>
> [1] https://issues.apache.org/jira/browse/SPARK-23717
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin <r...@databricks.com> wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>


Hadoop 3 support

2018-04-02 Thread Reynold Xin
Does anybody know what needs to be done in order for Spark to support
Hadoop 3?


Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-28 Thread Reynold Xin
If you need the functionality I would recommend you just copying the code
over to your project and use it that way.

On Wed, Mar 28, 2018 at 9:02 AM Felix Cheung 
wrote:

> I think the difference is py4j is a public library whereas the R backend
> is specific to SparkR.
>
> Can you elaborate what you need JVMObjectTracker for? We have provided R
> convenient APIs to call into JVM: sparkR.callJMethod for example
>
> _
> From: Jeremy Liu 
> Sent: Tuesday, March 27, 2018 12:20 PM
> Subject: Re: [Spark R] Proposal: Exposing RBackend in RRunner
> To: 
>
>
>
> Spark Dev,
>
> On second thought, the below topic seems more appropriate for spark-dev
> rather than spark-users:
>
> Spark Users,
>>
>> In SparkR, RBackend is created in RRunner.main(). This in particular
>> makes it difficult to control or use the RBackend. For my use case, I am
>> looking to access the JVMObjectTracker that RBackend maintains for SparkR
>> dataframes.
>>
>> Analogously, pyspark starts a py4j.GatewayServer in PythonRunner.main().
>> It's then possible to start a ClientServer that then has access to the
>> object bindings between Python/Java.
>>
>> Is there something similar for SparkR? Or a reasonable way to expose
>> RBackend?
>>
>> Thanks!
>>
> --
> -
> Jeremy Liu
> jeremy.jl@gmail.com
>
>
>


Re: Reserved Words in Spark SQL as TableAliases

2018-03-19 Thread Reynold Xin
I agree but the issue was backward compatibility...

On Mon, Mar 19, 2018 at 4:02 PM Russell Spitzer 
wrote:

> I found
> https://issues.apache.org/jira/browse/SPARK-20964
>
> but currently it seems like strictIdentifiers are allowed to contain any
> reserved key words
>
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L501-L503
>
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L703-L707
>
>
>
> For example both of these things are valid
>
> scala> spark.sql("SELECT MAX(id) FROM FUN GROUP BY id").show
> scala> spark.sql("SELECT MAX(id) FROM FUN WHERE GROUP BY id").show
>
> == Parsed Logical Plan ==
> 'Aggregate ['id], [unresolvedalias('MAX('id), None)]
> +- 'SubqueryAlias WHERE
>+- 'UnresolvedRelation `FUN`
>
> Because the second reference allows "WHERE" as an identifier for the Table
> Identifier. This could allow some unintended SQL. I think it might make
> sense to allow for reserved key words but only if they are actually escaped.
>
> I think that strictIdentifier should not allow reserved words but I
> perhaps am missing some history on this.
>
>


Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-06 Thread Reynold Xin
It's mostly just hash maps from some ids to some state, and those can be
replaced just with concurrent hash maps?

(I haven't actually looked at code and am just guessing based on
recollection.)

On Tue, Mar 6, 2018 at 10:42 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The problem with doing work in the callsite thread is that there are a
> number of data structures that are updated during job submission and
> these data structures are guarded by the event loop ensuring only one
> thread accesses them.  I dont think there is a very easy fix for this
> given the structure of the DAGScheduler.
>
> Thanks
> Shivaram
>
> On Tue, Mar 6, 2018 at 8:53 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > I agree with Reynold. We don't need to use a separate pool, which would
> have
> > the problem you raised about FIFO. We just need to do the planning
> outside
> > of the scheduler loop. The call site thread sounds like a reasonable
> place
> > to me.
> >
> > On Mon, Mar 5, 2018 at 12:56 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>
> >> Rather than using a separate thread pool, perhaps we can just move the
> >> prep code to the call site thread?
> >>
> >>
> >> On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty <ajith.she...@huawei.com>
> >> wrote:
> >>>
> >>> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
> >>> events has to be processed as DAGSchedulerEventProcessLoop is single
> >>> threaded and it will block other tasks in queue like TaskCompletion.
> >>>
> >>> The JobSubmitted event is time consuming depending on the nature of the
> >>> job (Example: calculating parent stage dependencies, shuffle
> dependencies,
> >>> partitions) and thus it blocks all the events to be processed.
> >>>
> >>>
> >>>
> >>> I see multiple JIRA referring to this behavior
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-2647
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-4961
> >>>
> >>>
> >>>
> >>> Similarly in my cluster some jobs partition calculation is time
> consuming
> >>> (Similar to stack at SPARK-2647) hence it slows down the spark
> >>> DAGSchedulerEventProcessLoop which results in user jobs to slowdown,
> even if
> >>> its tasks are finished within seconds, as TaskCompletion Events are
> >>> processed at a slower rate due to blockage.
> >>>
> >>>
> >>>
> >>> I think we can split a JobSubmitted Event into 2 events
> >>>
> >>> Step 1. JobSubmittedPreperation - Runs in separate thread on
> >>> JobSubmission, this will involve steps
> >>> org.apache.spark.scheduler.DAGScheduler#createResultStage
> >>>
> >>> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
> >>> DAGSchedulerEventProcessLoop and let it process output of
> >>> org.apache.spark.scheduler.DAGScheduler#createResultStage
> >>>
> >>>
> >>>
> >>> I can see the effect of doing this may be that Job Submissions may not
> be
> >>> FIFO depending on how much time Step 1 mentioned above is going to
> consume.
> >>>
> >>>
> >>>
> >>> Does above solution suffice for the problem described? And is there any
> >>> other side effect of this solution?
> >>>
> >>>
> >>>
> >>> Regards
> >>>
> >>> Ajith
> >>
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-05 Thread Reynold Xin
Rather than using a separate thread pool, perhaps we can just move the prep
code to the call site thread?


On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty 
wrote:

> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
> events has to be processed as DAGSchedulerEventProcessLoop is single
> threaded and it will block other tasks in queue like TaskCompletion.
>
> The JobSubmitted event is time consuming depending on the nature of the
> job (Example: calculating parent stage dependencies, shuffle dependencies,
> partitions) and thus it blocks all the events to be processed.
>
>
>
> I see multiple JIRA referring to this behavior
>
> https://issues.apache.org/jira/browse/SPARK-2647
>
> https://issues.apache.org/jira/browse/SPARK-4961
>
>
>
> Similarly in my cluster some jobs partition calculation is time consuming
> (Similar to stack at SPARK-2647) hence it slows down the spark
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even
> if its tasks are finished within seconds, as TaskCompletion Events are
> processed at a slower rate due to blockage.
>
>
>
> I think we can split a JobSubmitted Event into 2 events
>
> Step 1. JobSubmittedPreperation - Runs in separate thread on
> JobSubmission, this will involve steps org.apache.spark.scheduler.
> DAGScheduler#createResultStage
>
> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
> DAGSchedulerEventProcessLoop and let it process output of
> org.apache.spark.scheduler.DAGScheduler#createResultStage
>
>
>
> I can see the effect of doing this may be that Job Submissions may not be
> FIFO depending on how much time Step 1 mentioned above is going to consume.
>
>
>
> Does above solution suffice for the problem described? And is there any
> other side effect of this solution?
>
>
>
> Regards
>
> Ajith
>


Re: Welcoming some new committers

2018-03-02 Thread Reynold Xin
Congrats and welcome!


On Fri, Mar 2, 2018 at 10:41 PM, Matei Zaharia 
wrote:

> Hi everyone,
>
> The Spark PMC has recently voted to add several new committers to the
> project, based on their contributions to Spark 2.3 and other past work:
>
> - Anirudh Ramanathan (contributor to Kubernetes support)
> - Bryan Cutler (contributor to PySpark and Arrow support)
> - Cody Koeninger (contributor to streaming and Kafka support)
> - Erik Erlandson (contributor to Kubernetes support)
> - Matt Cheah (contributor to Kubernetes support and other parts of Spark)
> - Seth Hendrickson (contributor to MLlib and PySpark)
>
> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as
> committers!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Reynold Xin
This was actually an AMPLab bucket.

On Feb 27, 2018, 6:04 PM +1300, Holden Karau , wrote:
> Thanks Nick, we deprecated this during the roll over to the new release 
> managers. I assume this bucket was maintained by someone at databricks so 
> maybe they can chime in.
>
> > On Feb 26, 2018 8:57 PM, "Nicholas Chammas"  
> > wrote:
> > > If you go to the Downloads page and download Spark 2.2.1, you’ll get a 
> > > link to an Apache mirror. It didn’t use to be this way. As recently as 
> > > Spark 2.2.0, downloads were served via CloudFront, which was backed by an 
> > > S3 bucket named spark-related-packages.
> > > It seems that we’ve stopped using CloudFront, and the S3 bucket behind it 
> > > has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m 
> > > guessing this is part of an effort to use the Apache mirror network, like 
> > > other Apache projects do.
> > > From a user perspective, the Apache mirror network is several steps down 
> > > from using a modern CDN. Let me summarize why:
> > >
> > > 1. Apache mirrors are often slow. Apache does not impose any performance 
> > > requirements on its mirrors. The difference between getting a good mirror 
> > > and a bad one means downloading Spark in less than a minute vs. 20 
> > > minutes. The problem is so bad that I’ve thought about adding an Apache 
> > > mirror blacklist to Flintrock to avoid getting one of these dud mirrors.
> > > 2. Apache mirrors are inconvenient to use. When you download something 
> > > from an Apache mirror, you get a link like this one. Instead of 
> > > automatically redirecting you to your download, though, you need to 
> > > process the results you get back to find your download target. And you 
> > > need to handle the high download failure rate, since sometimes the mirror 
> > > you get doesn’t have the file it claims to have.
> > > 3. Apache mirrors are incomplete. Apache mirrors only keep around the 
> > > latest releases, save for a few “archive” mirrors, which are often slow. 
> > > So if you want to download anything but the latest version of Spark, you 
> > > are out of luck.
> > >
> > > Some of these problems can be mitigated by picking a specific mirror that 
> > > works well and hardcoding it in your scripts, but that defeats the 
> > > purpose of dynamically selecting a mirror and makes you a “bad” user of 
> > > the mirror network.
> > > I raised some of these issues over on INFRA-10999. The ticket sat for a 
> > > year before I heard anything back, and the bottom line was that none of 
> > > the above problems have a solution on the horizon. It’s fine. I 
> > > understand that Apache is a volunteer organization and that the 
> > > infrastructure team has a lot to manage as it is. I still find it 
> > > disappointing that an organization of Apache’s stature doesn’t have a 
> > > better solution for this in collaboration with a third party. Python 
> > > serves PyPI downloads using Fastly and Homebrew serves packages using 
> > > Bintray. They both work really, really well. Why don’t we have something 
> > > as good for Apache projects? Anyway, that’s a separate discussion.
> > > What I want to say is this:
> > > Dear whoever owns the spark-related-packages S3 bucket,
> > > Please keep the bucket up-to-date with the latest Spark releases, 
> > > alongside the past releases that are already on there. It’s a huge help 
> > > to the Flintrock project, and it’s an equally big help to those of us 
> > > writing infrastructure automation scripts that deploy Spark in other 
> > > contexts.
> > > I understand that hosting this stuff is not free, and that I am not 
> > > paying anything for this service. If it needs to go, so be it. But I 
> > > wanted to take this opportunity to lay out the benefits I’ve enjoyed 
> > > thanks to having this bucket around, and to make sure that if it did die, 
> > > it didn’t die a quiet death.
> > > Sincerely,
> > > Nick
> > >
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Reynold Xin
+1

On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal , wrote:
> > > this file shouldn't be included? 
> > > https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
> >
> > I've now deleted this file
> >
> > > From: Sameer Agarwal 
> > > Sent: Saturday, February 17, 2018 1:43:39 PM
> > > To: Sameer Agarwal
> > > Cc: dev
> > > Subject: Re: [VOTE] Spark 2.3.0 (RC4)
> > >
> > > I'll start with a +1 once again.
> > >
> > > All blockers reported against RC3 have been resolved and the builds are 
> > > healthy.
> > >
> > > > On 17 February 2018 at 13:41, Sameer Agarwal  
> > > > wrote:
> > > > > Please vote on releasing the following candidate as Apache Spark 
> > > > > version 2.3.0. The vote is open until Thursday February 22, 2018 at 
> > > > > 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes 
> > > > > are cast.
> > > > >
> > > > >
> > > > > [ ] +1 Release this package as Apache Spark 2.3.0
> > > > >
> > > > > [ ] -1 Do not release this package because ...
> > > > >
> > > > >
> > > > > To learn more about Apache Spark, please see https://spark.apache.org/
> > > > >
> > > > > The tag to be voted on is v2.3.0-rc4: 
> > > > > https://github.com/apache/spark/tree/v2.3.0-rc4 
> > > > > (44095cb65500739695b0324c177c19dfa1471472)
> > > > >
> > > > > List of JIRA tickets resolved in this release can be found here: 
> > > > > https://issues.apache.org/jira/projects/SPARK/versions/12339551
> > > > >
> > > > > The release files, including signatures, digests, etc. can be found 
> > > > > at:
> > > > > https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/
> > > > >
> > > > > Release artifacts are signed with the following key:
> > > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > > >
> > > > > The staging repository for this release can be found at:
> > > > > https://repository.apache.org/content/repositories/orgapachespark-1265/
> > > > >
> > > > > The documentation corresponding to this release can be found at:
> > > > > https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html
> > > > >
> > > > >
> > > > > FAQ
> > > > >
> > > > > ===
> > > > > What are the unresolved issues targeted for 2.3.0?
> > > > > ===
> > > > >
> > > > > Please see https://s.apache.org/oXKi. At the time of writing, there 
> > > > > are currently no known release blockers.
> > > > >
> > > > > =
> > > > > How can I help test this release?
> > > > > =
> > > > >
> > > > > If you are a Spark user, you can help us test this release by taking 
> > > > > an existing Spark workload and running on this release candidate, 
> > > > > then reporting any regressions.
> > > > >
> > > > > If you're working in PySpark you can set up a virtual env and install 
> > > > > the current RC and see if anything important breaks, in the 
> > > > > Java/Scala you can add the staging repository to your projects 
> > > > > resolvers and test with the RC (make sure to clean up the artifact 
> > > > > cache before/after so you don't end up building with a out of date RC 
> > > > > going forward).
> > > > >
> > > > > ===
> > > > > What should happen to JIRA tickets still targeting 2.3.0?
> > > > > ===
> > > > >
> > > > > Committers should look at those and triage. Extremely important bug 
> > > > > fixes, documentation, and API tweaks that impact compatibility should 
> > > > > be worked on immediately. Everything else please retarget to 2.3.1 or 
> > > > > 2.4.0 as appropriate.
> > > > >
> > > > > ===
> > > > > Why is my bug not fixed?
> > > > > ===
> > > > >
> > > > > In order to make timely releases, we will typically not hold the 
> > > > > release unless the bug in question is a regression from 2.2.0. That 
> > > > > being said, if there is something which is a regression from 2.2.0 
> > > > > and has not been correctly targeted please ping me or a committer to 
> > > > > help target the issue (you can see the open issues listed as 
> > > > > impacting Spark 2.3.0 at https://s.apache.org/WmoI).
> > >
> > >
> > >
> > > --
> > > Sameer Agarwal
> > > Computer Science | UC Berkeley
> > > http://cs.berkeley.edu/~sameerag
>
>
>
> --
> Sameer Agarwal
> Computer Science | UC Berkeley
> http://cs.berkeley.edu/~sameerag


Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Reynold Xin
Does it gain us anything to drop 2.6?

> On Feb 8, 2018, at 10:50 AM, Sean Owen  wrote:
> 
> At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly old, 
> and actually, not different from 2.7 with respect to Spark. That is, I don't 
> know if we are actually maintaining anything here but a separate profile and 
> 2x the number of test builds.
> 
> The cost is, by the same token, low. However I'm floating the idea of 
> removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: data source v2 online meetup

2018-02-01 Thread Reynold Xin
Still would be good to join. We can also do an additional one in March to
give people more time.


On Thu, Feb 1, 2018 at 3:59 PM, Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I can try to do a quick scratch implementation to see how the connector
> fits in, but we are in the middle of release land so I don't have the
> amount of time I really need to think about this. I'd be glad to join any
> hangout to discuss everything though.
>
> On Thu, Feb 1, 2018 at 11:15 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> We don't mind updating Iceberg when the API improves. We are fully aware
>> that this is a very early implementation and will change. My hope is that
>> the community is receptive to our suggestions.
>>
>> A good example of an area with friction is filter and projection
>> push-down. The implementation for DSv2 isn't based on what the other read
>> paths do, it is a brand new and mostly untested. I don't really understand
>> why DSv2 introduced a new code path, when reusing existing code for this
>> ended up being smaller and works for more cases (see my comments on
>> #20476 <https://github.com/apache/spark/pull/20476>). I understand
>> wanting to fix parts of push-down, just not why it is a good idea to mix
>> that substantial change into an unrelated API update. This is one area
>> where, I hope, our suggestion to get DSv2 working well and redesign
>> push-down as a parallel effort is heard.
>>
>> I also see a few areas where the integration of DSv2 conflicts with what
>> I understand to be design principles of the catalyst optimizer. The fact
>> that it should use immutable nodes in plans is mostly settled, but there
>> are other examples. The approach of the new push-down implementation fights
>> against the principle of small rules that don't need to process the entire
>> plan tree. I think this makes the component brittle, and I'd like to
>> understand the rationale for going with this design. I'd love to see a
>> design document that covers why this is a necessary choice (but again,
>> separately).
>>
>> rb
>>
>> On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> +1 hangout
>>>
>>> --
>>> *From:* Xiao Li <gatorsm...@gmail.com>
>>> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
>>> *To:* Ryan Blue
>>> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer
>>> *Subject:* Re: data source v2 online meetup
>>>
>>> Hi, Ryan,
>>>
>>> wow, your Iceberg already used data source V2 API! That is pretty cool!
>>> I am just afraid these new APIs are not stable. We might deprecate or
>>> change some data source v2 APIs in the next version (2.4). Sorry for the
>>> inconvenience it might introduce.
>>>
>>> Thanks for your feedback always,
>>>
>>> Xiao
>>>
>>>
>>> 2018-01-31 15:54 GMT-08:00 Ryan Blue <rb...@netflix.com.invalid>:
>>>
>>>> Thanks for suggesting this, I think it's a great idea. I'll definitely
>>>> attend and can talk about the changes that we've made DataSourceV2 to
>>>> enable our new table format, Iceberg
>>>> <https://github.com/Netflix/iceberg#about-iceberg>.
>>>>
>>>> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>>>>> whatever that has already been committed is only the first version and 
>>>>> we'd
>>>>> need more work post-2.3 to improve and stablize it.
>>>>>
>>>>> I think at this point we should stop making changes to it in
>>>>> branch-2.3, and instead focus on using the existing API and getting
>>>>> feedback for 2.4. Would people be interested in doing an online hangout to
>>>>> discuss this, perhaps in the month of Feb?
>>>>>
>>>>> It'd be more productive if people attending the hangout have tried the
>>>>> API by implementing some new sources or porting an existing source over.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-01-31 Thread Reynold Xin
You can just do that with mapPartitions pretty easily can’t you?

On Wed, Jan 31, 2018 at 11:08 PM Ruifeng Zheng  wrote:

> HI all:
>
>
>
>1, Dataset API supports operation “sortWithinPartitions”, but in
> RDD API there is no counterpart (I know there is
> “repartitionAndSortWithinPartitions”, but I don’t want to repartition the
> RDD), I have to convert RDD to Dataset for this function. Would it make
> sense to add a “sortWithinPartitions” for RDD?
>
>
>
>2, In “aggregateByKey”/”reduceByKey”, I want to do some special
> operation (like aggregator compression) after local aggregation on each
> partitions. A similar case may be: compute ‘ApproximatePercentile’ for
> different keys by ”reduceByKey”, it may be helpful if
> ‘QuantileSummaries#compress’ is called before network communication. So I
> wonder if it is useful to add a ‘aggregateWithinPartitions’ for RDD?
>
>
>
> Regards,
>
> Ruifeng
>
>
>
>
>
>
>
>
>


data source v2 online meetup

2018-01-31 Thread Reynold Xin
Data source v2 API is one of the larger main changes in Spark 2.3, and
whatever that has already been committed is only the first version and we'd
need more work post-2.3 to improve and stablize it.

I think at this point we should stop making changes to it in branch-2.3,
and instead focus on using the existing API and getting feedback for 2.4.
Would people be interested in doing an online hangout to discuss this,
perhaps in the month of Feb?

It'd be more productive if people attending the hangout have tried the API
by implementing some new sources or porting an existing source over.


Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Reynold Xin
For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a
take into a priority queue based top implementation actually.


On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari  wrote:

> Hi All,
>
> Would it make sense to add a "top()" method to the Dataset API?
> This method would return a Dataset containing the top k elements, the
> caller
> may then do further processing on the Dataset or call collect(). This is in
> contrast with RDD's top() which returns a collected array.
>
> In terms of implementation, this would use a bounded priority queue, which
> will avoid sorting all elements and run in O(n log k).
>
> I know something similar can be achieved by "orderBy().take()", but I am
> not
> sure if this is optimized.
> If that's not the case, and it performs sorting of all elements (therefore
> running in n log n), it might be handy to add this method.
>
> What do you think?
>
> Regards,
> Yacine.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify ColumnVector.getArray

2018-01-26 Thread Reynold Xin
I have no idea. Some JIRA update? Might want to file an INFRA ticket.


On Fri, Jan 26, 2018 at 10:04 AM, Sean Owen  wrote:

> This is an example of the "*** UNCHECKED ***" message I was talking about
> -- it's part of the email subject rather than JIRA.
>
> -- Forwarded message -
> From: Xiao Li (JIRA) 
> Date: Fri, Jan 26, 2018 at 11:18 AM
> Subject: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify
> ColumnVector.getArray
> To: 
>
>
>
>  [ https://issues.apache.org/jira/browse/SPARK-23218?page=
> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Xiao Li resolved SPARK-23218.
> -
>Resolution: Fixed
> Fix Version/s: 2.3.0
>
> > simplify ColumnVector.getArray
> > --
> >
> > Key: SPARK-23218
> > URL: https://issues.apache.org/jira/browse/SPARK-23218
> > Project: Spark
> >  Issue Type: Sub-task
> >  Components: SQL
> >Affects Versions: 2.3.0
> >Reporter: Wenchen Fan
> >Assignee: Wenchen Fan
> >Priority: Major
> > Fix For: 2.3.0
> >
> >
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> -
> To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
> For additional commands, e-mail: issues-h...@spark.apache.org
>
>


Re: What is "*** UNCHECKED ***"?

2018-01-26 Thread Reynold Xin
Examples?


On Fri, Jan 26, 2018 at 9:56 AM, Sean Owen  wrote:

> I probably missed this, but what is the new "*** UNCHECKED ***" message in
> the subject line of some JIRAs?
>


Re: Spark 3

2018-01-19 Thread Reynold Xin
We can certainly provide a build for Scala 2.12, even in 2.x.


On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Would that mean supporting both 2.12 and 2.11? Could be a while before
> some of our libraries are off of 2.11.
>
> Thanks,
> Justin
>
>
> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
>
> i was expecting to be able to move to scala 2.12 sometime this year
>
> if this cannot be done in spark 2.x then that could be a compelling reason
> to move spark 3 up to 2018 i think
>
> hadoop 3 sounds great but personally i have no use case for it yet
>
> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  wrote:
>
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it
>> would be more about making all those accumulated breaking changes and
>> updating lots of dependencies. Hadoop 3 looms large in that list as well as
>> Scala 2.12.
>>
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3
>> is out in Feb 2018 and it takes the now-usual 6 months until a next
>> release, Spark 3 could reasonably be next.
>>
>> However the release cycles are naturally slowing down, and it could also
>> be said that 2019 would be more on schedule for Spark 3.
>>
>> Nothing particularly urgent about deciding, but I'm curious if anyone had
>> an opinion on whether to move on to Spark 3 next or just continue with 2.4
>> later this year.
>>
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>>
>>> Yeah, if users are using Kryo directly, they should be insulated from a
>>> Spark-side change because of shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>


>
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Reynold Xin
I don’t think Spark relies on Kryo or Java for persistence. User programs
might though so it would be great if we can shade it.

On Fri, Jan 19, 2018 at 5:55 AM Sean Owen  wrote:

> See:
>
> https://issues.apache.org/jira/browse/SPARK-23131
> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>
> I expected a major Kryo upgrade to be problematic, but it worked fine. It
> picks up a number of fixes:
> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>
> It might be good for Spark 2.4.
>
> Its serialized format isn't entirely compatible though. I'm trying to
> recall whether this is a problem in practice. We don't guarantee wire
> compatibility across mismatched Spark versions, right?
>
> But does the Kryo serialized form show up in any persistent stored form? I
> don't believe any normal output, even that of saveAsObjectFile, uses it.
>
> I'm wondering if I am not recalling why this would be a problem to update?
>
> Sean
>


Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in
Jenkins (perhaps not run on every commit but can be run weekly or
pre-release smoke tests), that'd be great. Then it relies less on
contributors manually testing.


On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen  wrote:

> 2) will be ideal but given the velocity of main branch, what Mesos
> ended up doing was simply having a separate repo since it will take
> too long to merge back to main.
>
> We ended up running it pre-release (or major PR merged) and not on
> every PR, I will also comment on asking users to run it.
>
> We did have conversations with Reynold about potentially have the
> ability to run the CI on every [Mesos] tagged PR but we never got
> there.
>
> Tim
>
> On Mon, Jan 8, 2018 at 10:16 PM, Anirudh Ramanathan
>  wrote:
> > This is with regard to the Kubernetes Scheduler Backend and scaling the
> > process to accept contributions. Given we're moving past upstreaming
> changes
> > from our fork, and into getting new patches, I wanted to start this
> > discussion sooner than later. This is more of a post-2.3 question - not
> > something we're looking to solve right away.
> >
> > While unit tests are handy, they're not nearly as good at giving us
> > confidence as a successful run of our integration tests against
> > single/multi-node k8s clusters. Currently, we have integration testing
> setup
> > at https://github.com/apache-spark-on-k8s/spark-integration and it's
> running
> > continuously against apache/spark:master in pepperdata-jenkins (on
> minikube)
> > & k8s-testgrid (in GKE clusters). Now, the question is - how do we make
> > integration-tests part of the PR author's workflow?
> >
> > 1. Keep the integration tests in the separate repo and require that
> > contributors run them, add new tests prior to accepting their PRs as a
> > policy. Given minikube is easy to setup and can run on a single-node, it
> > would certainly be possible. Friction however, stems from contributors
> > potentially having to modify the integration test code hosted in that
> > separate repository when adding/changing functionality in the scheduler
> > backend. Also, it's certainly going to lead to at least brief
> > inconsistencies between the two repositories.
> >
> > 2. Alternatively, we check in the integration tests alongside the actual
> > scheduler backend code. This would work really well and is what we did in
> > our fork. It would have to be a separate package which would take certain
> > parameters (like cluster endpoint) and run integration test code against
> a
> > local or remote cluster. It would include least some code dealing with
> > accessing the cluster, reading results from K8s containers, test
> fixtures,
> > etc.
> >
> > I see value in adopting (2), given it's a clearer path for contributors
> and
> > lets us keep the two pieces consistent, but it seems uncommon elsewhere.
> How
> > do the other backends, i.e. YARN, Mesos and Standalone deal with
> accepting
> > patches and ensuring that they do not break existing clusters? Is there
> > automation employed for this thus far? Would love to get opinions on (1)
> v/s
> > (2).
> >
> > Thanks,
> > Anirudh
> >
> >
>


Re: [SPIP] as-of join in Spark SQL

2018-01-03 Thread Reynold Xin
I've replied on the ticket online ...


On Wed, Jan 3, 2018 at 11:41 AM, Li Jin  wrote:

> Hi community,
>
> Following instruction on https://spark.apache.org/
> improvement-proposals.html, I'd like to propose a SPIP: as-of join in
> Spark SQL.
>
> Here is the Jira:
> https://issues.apache.org/jira/browse/SPARK-22947
>
> If you are interested, please take a look and let me know what you think.
> I am looking forward to feedbacks.
>
> Thank you
>


Re: A list of major features in 2.3

2018-01-03 Thread Reynold Xin
It hasn't been compiled yet, but you can look up all the features on JIRA
by setting a filter on fixed versions. Usually the release manager compiles
the list when it is towards the end of the release cycle (coming up soon).


On Mon, Dec 25, 2017 at 10:07 PM, Anoop Saxena 
wrote:

> Hi,
> Is there a list of major features going into 2.3 release available for
> consumption?
>
> Thanks,
> Anoop
>


Re: Result obtained before the completion of Stages

2017-12-27 Thread Reynold Xin
Is it possible there is a bug for the UI? If you can run jstack on the
executor process to see whether anything is actually running, that can help
narrow down the issue.

On Tue, Dec 26, 2017 at 10:28 PM ckhari4u  wrote:

> Hi Reynold,
>
> I am running a Spark SQL query.
>
> val df = spark.sql("select * from table1 t1 join table2 t2 on
> t1.col1=t2.col1")
> df.count()
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Result obtained before the completion of Stages

2017-12-26 Thread Reynold Xin
What did you run?


On Tue, Dec 26, 2017 at 10:21 PM, ckhari4u  wrote:

> Hi Sean,
>
> Thanks for the reply. I believe I am not facing the scenarios you
> mentioned.
>
> Timestamp conflict: I see the Spark driver logs on the console (tried with
> INFO and DEBUG). In all the scenarios, I see the result getting printed and
> the application execution continues for 4 more minutes.
> ie: I have seen scenarios where Spark History Server time stamp not
> matching
> with the Spark driver logs and all. In this case, I am checking only the
> driver logs and I could see the logs getting printed on the console even
> after the result is generated.
>
> Stages of a different action: I am performing a join on 2 tables and doing
> a
> count operation. So there is only one action. The stage which is taking
> more
> time is the join phase (Sort merge join specifically). To improve the join,
> I tried to cache the smaller dataset. Then I do not see the issue.
>
> I am just wondering how Spark can get the result before the completion of
> the join operation.
>
> PS: My actual query in the application has many operators, UDF's etc. The
> above is the minimal operation query for which I am able to reproduce the
> issue.
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [01/51] [partial] spark-website git commit: 2.2.1 generated doc

2017-12-17 Thread Reynold Xin
There is an additional step that's needed to update the symlink, and that
step hasn't been done yet.


On Sun, Dec 17, 2017 at 12:32 PM, Jacek Laskowski  wrote:

> Hi Sean,
>
> What does "Not all the pieces are released yet" mean if you don't mind me
> asking? 2.2.1 has already been announced, hasn't it? [1]
>
> [1] http://spark.apache.org/news/spark-2-2-1-released.html
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Sun, Dec 17, 2017 at 4:19 PM, Sean Owen  wrote:
>
>> /latest does not point to 2.2.1 yet. Not all the pieces are released yet,
>> as I understand?
>>
>> On Sun, Dec 17, 2017 at 8:12 AM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I saw the following commit, but I can't seem to see 2.2.1 as the version
>>> in the header of the documentation pages under http://spark.apache.org/
>>> docs/latest/ (that is still 2.2.0). Is this being worked on?
>>>
>>> http://spark.apache.org/docs/2.2.1 is available and shows the proper
>>> version, but not http://spark.apache.org/docs/latest :(
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>>
>>>
>


Re: Decimals

2017-12-13 Thread Reynold Xin
Responses inline

On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido  wrote:

> Hi all,
>
> I saw in these weeks that there are a lot of problems related to decimal
> values (SPARK-22036, SPARK-22755, for instance). Some are related to
> historical choices, which I don't know, thus please excuse me if I am
> saying dumb things:
>
>  - why are we interpreting literal constants in queries as Decimal and not
> as Double? I think it is very unlikely that a user can enter a number which
> is beyond Double precision.
>

Probably just to be consistent with some popular databases.



>  - why are we returning null in case of precision loss? Is this approach
> better than just giving a result which might loose some accuracy?
>

The contract with decimal is that it should never lose precision (it is
created for financial reports, accounting, etc). Returning null is at least
telling the user the data type can no longer support the precision required.



>
> Thanks,
> Marco
>


Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Reynold Xin
Congrats.


On Fri, Dec 1, 2017 at 12:10 AM, Felix Cheung 
wrote:

> This vote passes. Thanks everyone for testing this release.
>
>
> +1:
>
> Sean Owen (binding)
>
> Herman van Hövell tot Westerflier (binding)
>
> Wenchen Fan (binding)
>
> Shivaram Venkataraman (binding)
>
> Felix Cheung
>
> Henry Robinson
>
> Hyukjin Kwon
>
> Dongjoon Hyun
>
> Kazuaki Ishizaki
>
> Holden Karau
>
> Weichen Xu
>
>
> 0: None
>
> -1: None
>
>
>
>
> On Wed, Nov 29, 2017 at 3:21 PM Weichen Xu 
> wrote:
>
>> +1
>>
>> On Thu, Nov 30, 2017 at 6:27 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> SHA, MD5 and signatures look fine. Built and ran Maven tests on my
>>> Macbook.
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau 
>>> wrote:
>>>
 +1 (non-binding)

 PySpark install into a virtualenv works, PKG-INFO looks correctly
 populated (mostly checking for the pypandoc conversion there).

 Thanks for your hard work Felix (and all of the testers :)) :)

 On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan 
 wrote:

> +1
>
> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki  > wrote:
>
>> +1 (non-binding)
>>
>> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>> for core/sql-core/sql-catalyst/mllib/mllib-local have passed.
>>
>> $ java -version
>> openjdk version "1.8.0_131"
>> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.
>> 16.04.3-b11)
>> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>>
>> % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn
>> -Phadoop-2.7 -T 24 clean package install
>> % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>> core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
>> ...
>> Run completed in 13 minutes, 54 seconds.
>> Total number of tests run: 1118
>> Suites: completed 170, aborted 0
>> Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
>> All tests passed.
>> [INFO] 
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Core . SUCCESS
>> [17:13 min]
>> [INFO] Spark Project ML Local Library . SUCCESS [
>>  6.065 s]
>> [INFO] Spark Project Catalyst . SUCCESS
>> [11:51 min]
>> [INFO] Spark Project SQL .. SUCCESS
>> [17:55 min]
>> [INFO] Spark Project ML Library ... SUCCESS
>> [17:05 min]
>> [INFO] 
>> 
>> [INFO] BUILD SUCCESS
>> [INFO] 
>> 
>> [INFO] Total time: 01:04 h
>> [INFO] Finished at: 2017-11-30T01:48:15+09:00
>> [INFO] Final Memory: 128M/329M
>> [INFO] 
>> 
>> [WARNING] The requested profile "hive" could not be activated because
>> it does not exist.
>>
>> Kazuaki Ishizaki
>>
>>
>>
>> From:Dongjoon Hyun 
>> To:Hyukjin Kwon 
>> Cc:Spark dev list , Felix Cheung <
>> felixche...@apache.org>, Sean Owen 
>> Date:2017/11/29 12:56
>> Subject:Re: [VOTE] Spark 2.2.1 (RC2)
>> --
>>
>>
>>
>> +1 (non-binding)
>>
>> RC2 is tested on CentOS, too.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon <*gurwls...@gmail.com*
>> > wrote:
>> +1
>>
>> 2017-11-29 8:18 GMT+09:00 Henry Robinson <*he...@apache.org*
>> >:
>> (My vote is non-binding, of course).
>>
>> On 28 November 2017 at 14:53, Henry Robinson <*he...@apache.org*
>> > wrote:
>> +1, tests all pass for me on Ubuntu 16.04.
>>
>> On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
>> *hvanhov...@databricks.com* > wrote:
>> +1
>>
>> On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung <
>> *felixche...@apache.org* > wrote:
>> +1
>>
>> Thanks Sean. Please vote!
>>
>> Tested various scenarios with R package. Ubuntu, Debian, Windows
>> r-devel and release and on r-hub. Verified CRAN checks are clean (only 1
>> NOTE!) and no leaked files (.cache removed, /tmp clean)
>>
>>
>> On Sun, Nov 26, 2017 at 11:55 AM Sean Owen 

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Reynold Xin
Is this due to the insert command not having metrics? It's a problem we
should fix.


On Mon, Nov 27, 2017 at 10:45 AM, Jason White 
wrote:

> I'd like to use the SparkListenerInterface to listen for some metrics for
> monitoring/logging/metadata purposes. The first ones I'm interested in
> hooking into are recordsWritten and bytesWritten as a measure of
> throughput.
> I'm using PySpark to write Parquet files from DataFrames.
>
> I'm able to extract a rich set of metrics this way, but for some reason the
> two that I want are always 0. This mirrors what I see in the Spark
> Application Master - the # records written field is always missing.
>
> I've filed a JIRA already for this issue:
> https://issues.apache.org/jira/browse/SPARK-22605
>
> I _think_ how this works is that inside the ResultTask.runTask method, the
> rdd.iterator call is incrementing the bytes read & records read via
> RDD.getOrCompute. Where would the equivalent be for the records written
> metrics?
>
> These metrics are populated properly if I save the data as an RDD via
> df.rdd.saveAsTextFile, so the code path exists somewhere. Any hints as to
> where I should be looking?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Faster and Lower memory implementation toPandas

2017-11-16 Thread Reynold Xin
Please send a PR. Thanks for looking at this.

On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade 
wrote:

> Hello devs,
>
> I know a lot of great work has been done recently with pandas to spark
> dataframes and vice versa using Apache Arrow, but I faced a specific pain
> point on a low memory setup without Arrow.
>
> Specifically I was finding a driver OOM running a toPandas on a small
> dataset (<100 MB compressed).  There was discussion about toPandas being
> slow
> 
> in March 2016 due to a self.collect().  A solution was found to create Pandas
> DataFrames or Numpy Arrays using MapPartitions for each partition
> , but it was never
> implemented back into dataframe.py
>
> I understand that using Apache arrow will solve this, but in a setup
> without Arrow (like the one where I faced the painpoint), I investigated
> memory usage of toPanda and to_pandas (dataframe per partition) and played
> with the number of partitions.  The findings are here
> 
> .
>
> The summary of the findings are that on a 147MB dataset, toPandas memory
> usage was about 784MB while while doing it partition by partition (with 100
> partitions) had an overhead of 76.30 MM and took almost half of the time to
> run.  I realize that Arrow solves this but the modification is quite small
> and would greatly assist anyone who isn't able to use Arrow.
>
> Would a PR [1] from me to address this issue be welcome?
>
> Thanks,
>
> Andrew
>
> [1] From Josh's Gist
>
> def _map_to_pandas(rdds):
> """ Needs to be here due to pickling issues """
> return [pd.DataFrame(list(rdds))]
>
> def toPandas(df, n_partitions=None):
> """
> Returns the contents of `df` as a local `pandas.DataFrame` in a speedy
> fashion. The DataFrame is
> repartitioned if `n_partitions` is passed.
> :param df:  pyspark.sql.DataFrame
> :param n_partitions:int or None
> :return:pandas.DataFrame
> """
> if n_partitions is not None: df = df.repartition(n_partitions)
> df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
> df_pand = pd.concat(df_pand)
> df_pand.columns = df.columns
> return df_pand
>


Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Reynold Xin
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis
is also data type, and the intersection explains what the coerced type is?
Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should
make sure most of the type coercions are consistent across different
functionalities, with the caveat that we need to preserve backward
compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan  wrote:

> My 2 cents:
>
> 1. when merging NullType with another type, the result should always be
> that type.
> 2. when merging StringType with another type, the result should always be
> StringType.
> 3. when merging integral types, the priority from high to low:
> DecimalType, LongType, IntegerType. This is because DecimalType is used as
> big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.
>
>
> On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon  wrote:
>
>> Hi dev,
>>
>> I would like to post a proposal about partitioned column type inference
>> (related with 'spark.sql.sources.partitionColumnTypeInference.enabled'
>> configuration).
>>
>> This thread focuses on the type coercion (finding the common type) in
>> partitioned columns, in particular, when the different form of data is
>> inserted for the partition column and then it is read back with the type
>> inference.
>>
>>
>> *Problem:*
>>
>>
>> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
>> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
>> spark.read.load("/tmp/foo").printSchema()
>> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
>> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
>> spark.read.load("/tmp/bar").printSchema()
>>
>>
>>
>> It currently returns:
>>
>>
>> root
>>  |-- i: integer (nullable = true)
>>  |-- ts: date (nullable = true)
>>
>> root
>>  |-- i: integer (nullable = true)
>>  |-- decimal: integer (nullable = true)
>>
>>
>> The type coercion looks less well designed yet and currently there are
>> few holes which is not quite ideal:
>>
>>
>> private val upCastingOrder: Seq[DataType] =
>>   Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
>> ...
>> literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
>>
>>
>>
>> The current way does not deal with when the types are outside of the
>> upCastingOrder. It just returns the first type, as the type coerced one.
>>
>> This has been being discussed in https://github.com/apache/s
>> park/pull/19389#discussion_r150426911, but I would like to have more
>> feedback from community as it possibly is a breaking change.
>>
>> For the current releases of Spark (2.2.0 <=), we support the types below
>> for partitioned column schema inference, given my investigation -
>> https://github.com/apache/spark/pull/19389#discussion_r150528207:
>>
>>   NullType
>>   IntegerType
>>   LongType
>>   DoubleType,
>>   *DecimalType(...)
>>   DateType
>>   TimestampType
>>   StringType
>>
>>   *DecimalType only when it's bigger than LongType:
>>
>> I believe this is something we should definitely fix.
>>
>>
>>
>> *Proposal:*
>>
>> I propose the change - https://github.com/apache/spark/pull/19389
>>
>> Simply, it reuses the case 2 specified in https://github.com/apache/s
>> park/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catal
>> yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/
>> TypeCoercion.scala#L40-L43
>>
>> Please refer the chart I produced here - https://github.com/apache/sp
>> ark/pull/19389/files#r150528361. The current proposal will brings the
>> type coercion behaviour change in those cases below:
>>
>>
>> Input typesOld output typeNew output type
>> [NullType, DecimalType(38,0)] StringType DecimalType(38,0)
>> [NullType, DateType] StringType DateType
>> [NullType, TimestampType] StringType TimestampType
>> [IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)
>> [IntegerType, DateType] IntegerType StringType
>> [IntegerType, TimestampType] IntegerType StringType
>> [LongType, DecimalType(38,0)] LongType DecimalType(38,0)
>> [LongType, DateType] LongType StringType
>> [LongType, TimestampType] LongType StringType
>> [DoubleType, DateType] DoubleType StringType
>> [DoubleType, TimestampType] DoubleType StringType
>> [DecimalType(38,0), NullType] StringType DecimalType(38,0)
>> [DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)
>> [DecimalType(38,0), LongType] LongType DecimalType(38,0)
>> [DecimalType(38,0), DateType] DecimalType(38,0) StringType
>> [DecimalType(38,0), TimestampType] DecimalType(38,0) StringType
>> [DateType, NullType] StringType DateType
>> [DateType, IntegerType] IntegerType StringType
>> [DateType, LongType] LongType StringType
>> [DateType, 

Re: how to replace hdfs with a custom distributed fs ?

2017-11-11 Thread Reynold Xin
You can implement the Hadoop FileSystem API for your distributed java fs
and just plug into Spark using the Hadoop API.


On Sat, Nov 11, 2017 at 9:37 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:

> hi i have my distributed java fs and i would like to implement my class
> for storing data in spark.
> How to do? it there a example how to do?
>


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Reynold Xin
The vote has passed with the following +1s:

Reynold Xin*
Debasish Das
Noman Khan
Wenchen Fan*
Matei Zaharia*
Weichen Xu
Vaquar Khan
Burak Yavuz
Xiao Li
Tom Graves*
Michael Armbrust*
Joseph Bradley*
Shixiong Zhu*


And the following +0s:

Sean Owen*


Thanks for the feedback!


On Wed, Nov 1, 2017 at 8:37 AM, Reynold Xin <r...@databricks.com> wrote:

> Earlier I sent out a discussion thread for CP in Structured Streaming:
>
> https://issues.apache.org/jira/browse/SPARK-20928
>
> It is meant to be a very small, surgical change to Structured Streaming to
> enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
>
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
>
>


Re: Jenkins upgrade/Test Parallelization & Containerization

2017-11-07 Thread Reynold Xin
My understanding is that AMP actually can provide more resources or adapt
changes, while ASF needs to manage 200+ projects and it's hard to
accommodate much. I could be wrong though.


On Tue, Nov 7, 2017 at 2:14 PM, Holden Karau  wrote:

> True, I think we've seen that the Amp Lab Jenkins needs to be more focused
> on running AMP Lab projects, and while I don't know how difficult the ASF
> Jenkins is I assume it might be an easier place to make changes going
> forward? (Of course this could be the grass is greener on the other side
> and I don't mean to say it's been hard to make changes on the AMP lab
> hardware, folks have been amazingly helpful - its just the projects on each
> have different needs).
>
> On Tue, Nov 7, 2017 at 12:52 PM, Sean Owen  wrote:
>
>> Faster tests would be great. I recall that the straightforward ways to
>> parallelize via Maven haven't worked because many tests collide with one
>> another. Is this about running each module's tests in a container? that
>> should work.
>>
>> I can see how this is becoming essential for repeatable and reliable
>> Python/R builds, which depend on the environment to a much greater extent
>> than the JVM does.
>>
>> I don't have a strong preference for AMPLab vs ASF builds. I suppose
>> using the ASF machinery is a little tidier. If it's got a later Jenkins
>> that's required, also a plus, but I assume updating AMPLab isn't so hard
>> here either. I think the key issue is which environment is easier to
>> control and customize over time.
>>
>>
>> On Wed, Nov 1, 2017 at 6:05 AM Xin Lu  wrote:
>>
>>> Hi everyone,
>>>
>>> I tried sending emails to this list and I'm not sure if it went through
>>> so I'm trying again.  Anyway, a couple months ago before I left Databricks
>>> I was working on a proof of concept that parallelized Spark tests on
>>> jenkins.  The way it worked was basically it build the spark jars and then
>>> ran all the tests in a docker container on a bunch of slaves in parallel.
>>> This cut the testing time down from 4 hours to approximately 1.5 hours.
>>> This required a newer version of jenkins and the Jenkins Pipeline plugin.
>>> I am wondering if it is possible to do this on amplab jenkins.  It looks
>>> like https://builds.apache.org/ has upgraded so Amplabs jenkins is a
>>> year or so behind.  I am happy to help with this project if it is something
>>> that people think is worthwhile.
>>>
>>> Thanks
>>>
>>> Xin
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Reynold Xin
Thanks Tom. I'd imagine more details belong either in a full design doc, or
a PR description. Might make sense to do an additional design doc, if there
is enough delta from the current sketch doc.


On Mon, Nov 6, 2017 at 7:29 AM, Tom Graves <tgraves...@yahoo.com> wrote:

> +1 for the idea and feature, but I think the design is definitely lacking
> detail on the internal changes needed and how the execution pieces work and
> the communication.  Are you planning on posting more of those details or
> were you just planning on discussing in PR?
>
> Tom
>
> On Wednesday, November 1, 2017, 11:29:21 AM CDT, Debasish Das <
> debasish.da...@gmail.com> wrote:
>
>
> +1
>
> Is there any design doc related to API/internal changes ? Will CP be the
> default in structured streaming or it's a mode in conjunction with
> exisiting behavior.
>
> Thanks.
> Deb
>
> On Nov 1, 2017 8:37 AM, "Reynold Xin" <r...@databricks.com> wrote:
>
> Earlier I sent out a discussion thread for CP in Structured Streaming:
>
> https://issues.apache.org/ jira/browse/SPARK-20928
> <https://issues.apache.org/jira/browse/SPARK-20928>
>
> It is meant to be a very small, surgical change to Structured Streaming to
> enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
>
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
>
>
>


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be
independent and we should be able to make a lot of maintenance releases as
needed.

On Thu, Nov 2, 2017 at 7:13 PM Sean Owen  wrote:

> The feature freeze is "mid November" :
> http://spark.apache.org/versioning-policy.html
> Let's say... Nov 15? any body have a better date?
>
> Although it'd be nice to get 2.2.1 out sooner than later in all events,
> and kind of makes sense to get out first, they need not go in order. It
> just might be distracting to deal with 2 at once.
>
> (BTW there was still one outstanding issue from the last release:
> https://issues.apache.org/jira/browse/SPARK-22401 )
>
> On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
> wrote:
>
>> I think it will be great to set a feature freeze date for 2.3.0 first, as
>> a minor release. There are a few new stuff that would be good to have and
>> then we will likely need time to stabilize, before cutting RCs.
>>
>>


<    1   2   3   4   5   6   7   8   9   10   >