incorporate the “options” specified for the
> data source into the catalog too?
>
> That may be helpful in some situations (e.g. the JDBC connect string being
> available from the catalog).
>
> *From: *Xiao Li
> *Date: *Monday, December 3, 2018 at 10:44 AM
> *To: *"Thakrar, Jay
d the Spark catalog be the common denominator of the other
> catalogs (least featured) or a super-feature catalog?
>
>
>
> *From: *Xiao Li
> *Date: *Saturday, December 1, 2018 at 10:49 PM
> *To: *Ryan Blue
> *Cc: *"u...@spark.apache.org"
> *Subject: *Re: DataSourceV2 co
uot;spark_catalog1.db3.tab2".
> The catalog will be used for registering all the external data sources,
> various Spark UDFs and so on.
>
> At the same time, we should NOT mix the table-level data sources with
> catalog support. That means, "Cassandra1.db1.tab1", "Kaf
the data source API V2 and catalog APIs are two separate projects.
> Hopefully, you understand my concern. If we really want to mix them
> together, I want to read the design of your multi-catalog support and
> understand more details.
>
> Thanks,
>
> Xiao
>
>
>
>
> R
Hi, Ryan Blue.
>
> I don't think it would be a good idea to add the sql-api module.
> I prefer to add sql-api to sql/core. The sql is just another representation
> of dataset, thus there is no need to add new module to do this. Besides, it
> would be easier to add sql-api in core.
>
we started: we can either expose
the v2 API from the catalyst package, or we can keep the v2 API, logical
plans, and rules in core instead of catalyst.
Anyone want to weigh in with a preference for how to move forward?
rb
--
Ryan Blue
Software Engineer
Netflix
unction, database
> and column is resolved? Do we have nickname, mapping, wrapper?
>
> Or I might miss the design docs you send? Could you post the doc?
>
> Thanks,
>
> Xiao
>
>
>
>
> Ryan Blue 于2018年11月29日周四 下午3:06写道:
>
>> Xiao,
>>
>> Please
t stage, but we need to know how the new
> proposal works. For example, how to plug in a new Hive metastore? How to
> plug in a Glue? How do users implement a new external catalog without
> adding any new data sources? Without knowing more details, it is hard to
> say whether this TableCatalog can
that TableCatalog is compatible with
future decisions and the best path forward is to build incrementally. An
exhaustive design process blocks progress on v2.
On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue wrote:
> Hi everyone,
>
> I just sent out an invite for the next DSv2 commu
and I’ll add you. Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
a while something would silently
> > > break, because PR builds only check the default. And the jenkins
> > > builds, which are less monitored, would stay broken for a while.
> > >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
a
couple of tests, it looks like live streams only work within an
organization. In the future, I won’t add a live stream since no one but
people from Netflix can join.
Last, here are the notes:
*Attendees*
Ryan Blue - Netflix
John Zhuge - Netflix
Yuanjian Li - Baidu - Interested in Catalog API
Felix
the meet up.
I'll also plan on joining earlier than I did last time, in case we the
meet/hangout needs to be up for people to view the live stream.
rb
On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue wrote:
> Hi everyone,
> I just wanted to send out a reminder that there’s a DSv2 sync tomorrow a
ies of a micro-batch) and may be then the
>> 'latest' offset is not needed at all.
>>
>> - Arun
>>
>>
>> On Tue, 13 Nov 2018 at 16:01, Ryan Blue
>> wrote:
>>
>>> Hi everyone,
>>> I just wanted to send out a reminder that there’
tPartition[] parts = stream.planInputPartitions(start)
// returns when needsReconfiguration is true or all tasks finish
runTasks(parts, factory, end)
// the stream's current offset has been updated at the last epoch
}
--
Ryan Blue
Software Engineer
Netflix
Another solution to the decimal case is using the capability API: use a
capability to signal that the table knows about `supports-decimal`. So
before the decimal support check, it would check
`table.isSupported("type-capabilities")`.
On Fri, Nov 9, 2018 at 12:45 PM Ryan B
to
> throw exceptions when they don't support a specific operation.
>
>
> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue wrote:
>
>> Do you have an example in mind where we might add a capability and break
>> old versions of data sources?
>>
>> These are really for
pporting that property, and thus throwing an
> exception.
>
>
> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue wrote:
>
>> I'd have two places. First, a class that defines properties supported and
>> identified by Spark, like the SQLConf definitions. Second, in documentation
>>
ting data*
>
> However it does not specify behavior when the table does not exist.
> Does that throw exception or create the table or a NO-OP?
>
> Thanks,
> Shubham
>
--
Ryan Blue
Software Engineer
Netflix
ned?
>
>
> --
> *From:* Ryan Blue
> *Sent:* Thursday, November 8, 2018 2:09 PM
> *To:* Reynold Xin
> *Cc:* Spark Dev List
> *Subject:* Re: DataSourceV2 capability API
>
>
> Yes, we currently use traits that have methods. Something like
ll evolve (e.g. how many different
> capabilities there will be).
>
>
> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> I’d like to propose an addition to DataSourceV2 tables, a capability API.
>> This API would allow Spark t
le. To fix
this problem, I would use a table capability, like
read-missing-columns-as-null.
Any comments on this approach?
rb
--
Ryan Blue
Software Engineer
Netflix
this in Spark community.
>>
>> Thanks,
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |
>> Apple, Inc
>>
>>
> --
> Robert Stupp
> @snazy
>
>
--
Ryan Blue
Software Engineer
Netflix
hnologies [not a contribution] |
> Apple, Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Thanks to everyone that attended the sync! We had some good discussions.
Here are my notes for anyone that missed it or couldn’t join the live
stream. If anyone wants to add to this, please send additional thoughts or
corrections.
*Attendees:*
- Ryan Blue - Netflix - Using v2 to integrate
; >> >> What should happen to JIRA tickets still targeting 2.4.0?
>>> >> >> ===
>>> >> >>
>>> >> >> The current list of open tickets targeted at 2.4.0 can be found at:
>>> >> >> https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 2.4.0
>>> >> >>
>>> >> >> Committers should look at those and triage. Extremely important bug
>>> >> >> fixes, documentation, and API tweaks that impact compatibility
>>> should
>>> >> >> be worked on immediately. Everything else please retarget to an
>>> >> >> appropriate release.
>>> >> >>
>>> >> >> ==
>>> >> >> But my bug isn't fixed?
>>> >> >> ==
>>> >> >>
>>> >> >> In order to make timely releases, we will typically not hold the
>>> >> >> release unless the bug in question is a regression from the
>>> previous
>>> >> >> release. That being said, if there is something which is a
>>> regression
>>> >> >> that has not been correctly targeted please ping me or a committer
>>> to
>>> >> >> help target the issue.
>>> >> >
>>> >> >
>>> -
>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >> >
>>> >>
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Spark+AI Summit North America 2019]
>> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
>>
>
--
Ryan Blue
Software Engineer
Netflix
end up with so many people that we can't
actually get the discussion going. Here's a link to the stream:
https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba
Thanks!
rb
On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue wrote:
> Hi everyone,
>
> There's been some great d
;
> I didn't know I live in the same timezone with you Wenchen :D.
> Monday or Wednesday at 5PM PDT sounds good to me too FWIW.
>
> 2018년 10월 26일 (금) 오전 8:29, Ryan Blue 님이 작성:
>
>> Good point. How about Monday or Wednesday at 5PM PDT then?
>>
>> Everyone, please repl
day at my side, it will be great if we can
> pick a day from Monday to Thursday.
>
> On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue wrote:
>
>> Since not many people have replied with a time window, how about we aim
>> for 5PM PDT? That should work for Wenchen and most peo
eting is definitely helpful to discuss, move certain effort
>>>>> forward and keep people on the same page. Glad to see this kind of working
>>>>> group happening.
>>>>>
>>>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote:
>>>&g
--
Ryan Blue
Software Engineer
Netflix
though Apache Spark provides the binary distributions, it would be
> great if this succeeds out of the box.
> >
> > Bests,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
elson, Assaf
>>> wrote:
>>>
>>> Could you add a fuller code example? I tried to reproduce it in my
>>> environment and I am getting just one instance of the reader…
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Assaf
>
inded message, I will probably have more as I continue
> to explore this.
>
> Thanks,
>Assaf.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
that converts from the parsed SQL plan to CatalogTable-based v1
plans. It is also cleaner to have the logic for converting to CatalogTable
in DataSourceAnalysis instead of in the parser itself.
Are there objections to this approach for integrating v2 plans?
--
Ryan Blue
Software Engineer
Netflix
add Hive compatible syntax later.
>
> On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue
> wrote:
>
>> I'd say that it was important to be compatible with Hive in the past, but
>> that's becoming less important over time. Spark is well established with
>> Hadoop users and I think the f
.
>
> I am personally following this PR with a lot of interest, thanks for all
> the work along this direction.
>
> Best regards,
> Alessandro
>
> On Mon, 1 Oct 2018 at 20:21, Ryan Blue wrote:
>
>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>
lowing the Hive DDL syntax:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>
> Ryan Blue 于2018年9月28日周五 下午3:47写道:
>
>> Hi everyone,
>>
>> I’m currently working on new table DDL statements for v2 tables. F
----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
if you have suggestions based on a different
SQL engine or want this syntax to be different for another reason. Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
ttp://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ion
> items as well.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
> Hi, Ryan.
>
> Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be
> helpful to narrow down the scope.
>
> Bests,
> Dongjoon.
>
> On Thu, Sep 20, 2018 at 11:56 Ryan Blue wrote:
>
>> -0
>>
>> My DataSourceV2 implementation for Iceberg is f
t;
>>>>>>>> > If you are a Spark user, you can help us test this release by
>>>>>>>> taking
>>>>>>>> > an existing Spark workload and running on this release candidate,
>>>>>>>> then
>>>>>>>> > reporting any regressions.
>>>>>>>> >
>>>>>>>> > If you're working in PySpark you can set up a virtual env and
>>>>>>>> install
>>>>>>>> > the current RC and see if anything important breaks, in the
>>>>>>>> Java/Scala
>>>>>>>> > you can add the staging repository to your projects resolvers and
>>>>>>>> test
>>>>>>>> > with the RC (make sure to clean up the artifact cache
>>>>>>>> before/after so
>>>>>>>> > you don't end up building with a out of date RC going forward).
>>>>>>>> >
>>>>>>>> > ===
>>>>>>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>>>>>>> > ===
>>>>>>>> >
>>>>>>>> > The current list of open tickets targeted at 2.3.2 can be found
>>>>>>>> at:
>>>>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>>> "Target Version/s" = 2.3.2
>>>>>>>> >
>>>>>>>> > Committers should look at those and triage. Extremely important
>>>>>>>> bug
>>>>>>>> > fixes, documentation, and API tweaks that impact compatibility
>>>>>>>> should
>>>>>>>> > be worked on immediately. Everything else please retarget to an
>>>>>>>> > appropriate release.
>>>>>>>> >
>>>>>>>> > ==
>>>>>>>> > But my bug isn't fixed?
>>>>>>>> > ==
>>>>>>>> >
>>>>>>>> > In order to make timely releases, we will typically not hold the
>>>>>>>> > release unless the bug in question is a regression from the
>>>>>>>> previous
>>>>>>>> > release. That being said, if there is something which is a
>>>>>>>> regression
>>>>>>>> > that has not been correctly targeted please ping me or a
>>>>>>>> committer to
>>>>>>>> > help target the issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>> --
>>> John
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
partition loading in Hive and Oracle.
>
>
>
> So in short, I agree that partition management should be an optional
> interface.
>
>
>
> *From: *Ryan Blue
> *Reply-To: *"rb...@netflix.com"
> *Date: *Wednesday, September 19, 2018 at 2:58 PM
>
>
> I ask because those are the most widely used data sources and have a lot
> of effort and thinking behind them, and if they have ported over to V2,
> then they can serve as excellent production examples of V2 API.
>
>
>
> Thanks,
>
> Jayesh
>
>
>
> *F
er
>> +- Project [Mort AS Mort#7, 1000 AS 1000#8]
>>+- OneRowRelation
>>
>> My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport
>> with ReadSupportWithSchema with WriteSupport
>>
>> I'm wondering if there is something I'm not implementing, or if there is
>> a bug in my implementation or its an issue with Spark?
>>
>> Any pointers would be great,
>>
>> Ross
>>
>
--
Ryan Blue
Software Engineer
Netflix
m generically in the API,
> allowing pass-through commands to manipulate them, or by some other
> means.
>
> Regards,
> Dale.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
amespace that needs it.
>>
>> If the data source requires TLS support then we also need to support
>> passing
>> all the configuration values under "spark.ssl.*"
>>
>> What do people think? Placeholder Issue has been added at SPARK-25329.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
quot;commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always be possible for the
>>>>>> driver to take over the transact
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always
and is
>> discoverable - thereby breaking the documented contract.
>>
>> I was wondering how other databases systems plan to implement this API
>> and meet the contract as per the Javadoc?
>>
>> Many thanks
>>
>> Ross
>>
>
--
Ryan Blue
Software Engineer
Netflix
gt;
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
--
Ryan Blue
Software Engineer
Netflix
8 at 3:02 PM Hyukjin Kwon wrote:
>
>> BTW, do we hold Datasource V2 related PRs for now until we finish this
>> refactoring just for clarification?
>>
>> 2018년 9월 7일 (금) 오전 12:52, Ryan Blue 님이 작성:
>>
>>> Wenchen,
>>>
>>> I'm not really su
gt; state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue
> wrote:
>
>> It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>>
>> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin wrote:
>
> trait Table {
> LogicalWrite newAppendWrite();
>
> LogicalWrite newDeleteWrite(deleteExprs);
> }
>
>
> It looks to me that the API is simpler without WriteConfig, what do you
> think?
>
> Thanks,
> Wenchen
>
> On Wed, Sep 5, 2018 at 4:24 AM Ryan Blue
> wrote:
also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue
>> wrote:
>>
>>> My concern is that the v2 data
a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM
dates for
>>>>>> consideration):
>>>>>> >>>
>>>>>> >>> 1. Support Scala 2.12.
>>>>>> >>>
>>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel)
>>>>>> deprecated in Spark 2.x.
>>>>>> >>>
>>>>>> >>> 3. Shade all dependencies.
>>>>>> >>>
>>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>>>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>>>>> backward compatibility mode.
>>>>>> >>>
>>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL
>>>>>> more standard compliant, and have a flag for backward compatibility.
>>>>>> >>>
>>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable,
>>>>>> not
>>>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Now the reality of a major version bump is that the world often
>>>>>> thinks in terms of what exciting features are coming. I do think there
>>>>>> are
>>>>>> a number of major changes happening already that can be part of the 3.0
>>>>>> release, if they make it in:
>>>>>> >>>
>>>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>>>> >>> 2. Continuous Processing non-experimental
>>>>>> >>> 3. Kubernetes support non-experimental
>>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t
>>>>>> think it is realistic to stabilize that in one release)
>>>>>> >>> 5. Hadoop 3.0 support
>>>>>> >>> 6. ...
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>>>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>>>>> release, rather than the individual feature requests. Those are important
>>>>>> but are best done in their own separate threads.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783
>>>> Greater Chicago
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783
>>> Greater Chicago
>>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>
--
Ryan Blue
Software Engineer
Netflix
1 gives Spark the opportunity
> to enforce column references are valid (but not the actual function names),
> whereas option 2 would be up to the data sources to validate.
>
>
>
> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue wrote:
>
>> I think I found a good solution to th
Latest from Wenchen in case it was dropped.
-- Forwarded message -
From: Wenchen Fan
Date: Mon, Sep 3, 2018 at 6:16 AM
Subject: Re: data source api v2 refactoring
To:
Cc: Ryan Blue , Reynold Xin , <
dev@spark.apache.org>
Hi Mridul,
I'm not sure what's going on, my
with ScanConfig.
> For streaming source, stream is the one to take care of the pushdown
> result. For batch source, it's the scan.
>
> It's a little tricky because stream is an abstraction for streaming source
> only. Better ideas are welcome!
>
> On Sat, Sep 1, 2018 at 7:26 AM Ry
th the
> above?
>
> At a high level, I think the Heilmeier's Catechism emphasizes less about
> the "how", and more the "why" and "what", which is what I'd argue SPIPs
> should be about. The hows should be left in design docs for larger projects.
>
>
>
--
Ryan Blue
Software Engineer
Netflix
the above:
>>
>> 1. Creates an explicit Table abstraction, and an explicit Scan
>> abstraction.
>>
>> 2. Have an explicit Stream level and makes it clear pushdowns and options
>> are handled there, rather than at the individual scan (ReadSupport) level.
>> Data source implementations don't need to worry about pushdowns or options
>> changing mid-stream. For batch, those happen when the scan object is
>> created.
>>
>>
>>
>> This email is just a high level sketch. I've asked Wenchen to prototype
>> this, to see if it is actually feasible and the degree of hacks it removes,
>> or creates.
>>
>>
>>
--
Ryan Blue
Software Engineer
Netflix
entation
>
> Thanks for your time,
> Russ
>
> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue
> wrote:
>
>> Thanks for posting this discussion to the dev list, it would be great to
>> hear what everyone thinks about the idea that USING should be a
>> catalog-specific
e should be supported anyway, I was
> thinking we could just orthogonally proceed. If you guys think other issues
> should be resolved first, I think we (at least I will) should take a look
> for the set of catalog APIs.
>
>
--
Ryan Blue
Software Engineer
Netflix
I think I found a good solution to the problem of using Expression in the
TableCatalog API and in the DeleteSupport API.
For DeleteSupport, there is already a stable and public subset of
Expression named Filter that can be used to pass filters. The reason why
DeleteSupport would use Expression is
PI, similar to what we did for
> dsv1.
>
> If we are depending on Expressions on the more common APIs in dsv2
> already, we should revisit that.
>
>
>
>
> On Mon, Aug 13, 2018 at 1:59 PM Ryan Blue wrote:
>
>> Reynold, did you get a chance to look at my response about
. Anyone else want to raise
an issue with the proposal, or is it about time to bring up a vote thread?
rb
On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue wrote:
> I don’t think that we want to block this work until we have a public and
> stable Expression. Like our decision to expose Internal
rce can provide catalog
> functionalities.
>
> Under the hood, I feel this proposal is very similar to my second
> proposal, except that a catalog implementation must provide a default data
> source/storage, and different rule for looking up tables.
>
>
> On Sun, Jul 29,
Wenchen, what I'm suggesting is a bit of both of your proposals.
I think that USING should be optional like your first option. USING (or
format(...) in the DF side) should configure the source or implementation,
while the catalog should be part of the table identifier. They serve two
different
ersions.
>
>
> On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue
> wrote:
>
>> The recently adopted SPIP to standardize logical plans requires a way for
>> to plug in providers for table metadata operations, so that the new plans
>> can create and drop tables. I proposed an A
Quick update: I've updated my PR to add the table catalog API to implement
this proposal. Here's the PR: https://github.com/apache/spark/pull/21306
On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote:
> Lately, I’ve been working on implementing the new SQL logical plans. I’m
> currently b
IP is for the APIs and does not cover how multiple catalogs would be
exposed. I started a separate discussion thread on how to access multiple
catalogs and maintain compatibility with Spark’s current behavior (how to
get the catalog instance in the above example).
Please use this thread to discuss the proposed APIs. Thanks, everyone!
rb
--
Ryan Blue
Software Engineer
Netflix
continue to use
the property to determine the table’s data source or format implementation.
Other table catalog implementations would be free to interpret the format
string as they choose or to use it to choose a data source implementation
as in the default catalog.
rb
--
Ryan Blue
Software Engineer
Netflix
This vote passes with 4 binding +1s and 9 community +1s.
Thanks for taking the time to vote, everyone!
Binding votes:
Wenchen Fan
Xiao Li
Reynold Xin
Felix Cheung
Non-binding votes:
Ryan Blue
John Zhuge
Takeshi Yamamuro
Marco Gaido
Russel Spitzer
Alessandro Solimando
Henry Robinson
Dongjoon
;>>>>> Note. RC2 was cancelled because of one blocking issue SPARK-24781
>>>>>>> during release preparation.
>>>>>>>
>>>>>>> FAQ
>>>>>>>
>>>>>>> =
>>>>>>> How can I help test this release?
>>>>>>> =
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>> then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>> Java/Scala
>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>> test
>>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>
>>>>>>> ===
>>>>>>> What should happen to JIRA tickets still targeting 2.3.2?
>>>>>>> ===
>>>>>>>
>>>>>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>>> Version/s" = 2.3.2
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>> appropriate release.
>>>>>>>
>>>>>>> ==
>>>>>>> But my bug isn't fixed?
>>>>>>> ==
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>> release. That being said, if there is something which is a regression
>>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>>
>>>>>> help target the issue.
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
--
Ryan Blue
Software Engineer
Netflix
+1 (not binding)
On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue wrote:
> Hi everyone,
>
> From discussion on the proposal doc and the discussion thread, I think we
> have consensus around the plan to standardize logical write operations for
> DataSourceV2. I would like
ple can
> jump
> > in during the development. I'm interested in the new API and like to
> work on
> > it after the vote passes.
> >
> > Thanks,
> > Wenchen
> >
> > On Fri, Jul 13, 2018 at 7:25 AM Ryan Blue wrote:
> >>
> >> Thanks! I'm a
rk should adopt the SPIP
[-1]: Spark should not adopt the SPIP because . . .
Thanks for voting, everyone!
--
Ryan Blue
hanks,
> Wenchen
>
> On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> A few weeks ago, I wrote up a proposal to standardize SQL logical plans
>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?
gt;>>>>> stream-stream
>>>>>> > join. Users can hit this bug if one of the join side is partitioned
>>>>>> by a
>>>>>> > subset of the join keys.
>>>>>> >
>>>>>> > SPARK-24552: Task at
level of parallelism (searching for a given object id when sorted
> by time needs to scan all/more the groups for larger times).
> One question here - is Parquet reader reading & decoding the projection
> columns even if the predicate columns should filter the record out?
>
> Unfortunatel
e know if there is anybody currently working on it
> or maybe you have it in a roadmap for the future?
> Or maybe you could give me some suggestions how to avoid / resolve this
> problem? I’m using Spark 2.2.1.
>
> Best regards,
> Jakub Wozniak
>
>
>
>
>
uot; nodes and see how many jobs
> was triggered and number of tasks and their duration. Now it's hard to
> debug it, especially for newbies.
>
> Pozdrawiam / Best regards,
> Tomek Gawęda
>
>
> On 2018-05-10 18:31, Ryan Blue wrote:
>
> > it would be fantastic if
(by
> >> replying here or updating the bug in Jira), otherwise I'm volunteering
> >> to prepare the first RC soon-ish (around the weekend).
> >>
> >> Thanks!
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -----
e weekend).
>
> Thanks!
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
end that triggered the
>>> error.
>>>
>>> I don’t know how feasible this is, but addressing it would directly
>>> solve the issue of linking failures to the responsible transformation, as
>>> opposed to leaving the user to break up a chain of trans
ent types of rows, so we forced
> the conversion at input.
>
> Can't your "wish" be satisfied by having the public API producing the
> internals of UnsafeRow (without actually exposing UnsafeRow)?
>
>
> On Tue, May 8, 2018 at 4:16 PM Ryan Blue <rb...@netflix.com>
, Reynold Xin <r...@databricks.com> wrote:
> What the internal operators do are strictly internal. To take one step
> back, is the goal to design an API so the consumers of the API can directly
> produces what Spark expects internally, to cut down perf cost?
>
>
> On Tue, May 8, 2
I've opened SPARK-24215 to track this.
On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote:
> Yup. Sounds great. This is something simple Spark can do and provide huge
> value to the end users.
>
>
> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <
are tools/ways to force the
>>> execution, helping in the debugging phase. So they can achieve without a
>>> big effort the same result, but with a big difference: they are aware of
>>> what is really happening, which may help them later.
>>>
>>> Thanks
nalRow, then there is an
easy performance win that also simplifies the v2 data source API.
rb
--
Ryan Blue
Software Engineer
Netflix
also struggle in similar
>> ways as these students. While eager execution is really not practical in
>> big data, in learning environments or in development against small, sampled
>> datasets it can be pretty helpful.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
or shouldn't
> come. Let me know if this understanding is correct
>
> On Tue, May 1, 2018 at 9:37 PM, Ryan Blue <rb...@netflix.com> wrote:
>
>> This is usually caused by skew. Sometimes you can work around it by in
>> creasing the number of partitions like you tri
re to Spark users. So a design that we can't migrate
> file sources to without a side channel would be worrying; won't we end up
> regressing to the same situation?
>
> On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> Should we really plan the API fo
kFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349)
>
>
--
Ryan Blue
Software Engineer
Netflix
has ever read. In order
> to parse this efficiently, the stream connector needs detailed control over
> how it's stored; the current implementation even has complex
> compactification and retention logic.
>
>
> On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue <rb...@netflix.com> wrot
t;>
>> After enabling checkpointing, I do see a folder being created under the
>> checkpoint folder, but there's nothing else in there.
>>
>>
>>
>> Same question for write-ahead and recovery?
>>
>> And on a restart from a failed streaming session - who should set the
>> offsets?
>>
>> The driver/Spark or the datasource?
>>
>>
>>
>> Any pointers to design docs would also be greatly appreciated.
>>
>>
>>
>> Thanks,
>>
>> Jayesh
>>
>>
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
ache.org/jira/browse/SPARK-18455>, but it's not clear
> to me whether they are "design-appropriate" for the DataFrame API.
>
> Are correlated subqueries a thing we can expect to have in the DataFrame
> API?
>
> Nick
>
>
--
Ryan Blue
Software Engineer
Netflix
201 - 300 of 399 matches
Mail list logo