Re: [SQL] [Suggestion] Add top() to Dataset

2018-02-02 Thread Yacine Mazari
I see, thanks a lot for the clarifications. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: DataSourceV2: support for named tables

2018-02-02 Thread Ryan Blue
I don’t have a good answer for that yet. My initial motivation here is mainly to get consensus around this: - DSv2 should support table names through SQL and the API, and - It should use the existing classes in the logical plan (i.e., TableIdentifier) To contrast, I think Wenchen is

Re: DataSourceV2: support for named tables

2018-02-02 Thread Michael Armbrust
I am definitely in favor of first-class / consistent support for tables and data sources. One thing that is not clear to me from this proposal is exactly what the interfaces are between: - Spark - A (The?) metastore - A data source If we pass in the table identifier is the data source then

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-02 Thread Michael Armbrust
> > So here are my recommendations for moving forward, with DataSourceV2 as a > starting point: > >1. Use well-defined logical plan nodes for all high-level operations: >insert, create, CTAS, overwrite table, etc. >2. Use rules that match on these high-level plan nodes, so that it >

Re: [MLlib] Gaussian Process regression in MLlib

2018-02-02 Thread Simon Dirmeier
Hey, I wanted to see that for a long time, too. :) If you'd plan on implementing this, I could contribute. However, I am not too familiar with variational inference for the GPs which is what you would need I guess. Or do you think it is feasible to compute the full kernel for the GP? Cheers,

DataSourceV2: support for named tables

2018-02-02 Thread Ryan Blue
There are two main ways to load tables in Spark: by name (db.table) and by a path. Unfortunately, the integration for DataSourceV2 has no support for identifying tables by name. I propose supporting the use of TableIdentifier, which is the standard way to pass around table names. The reason I

Re: Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
I am using spark 2.1.0 On Fri, Feb 2, 2018 at 5:08 PM, Pralabh Kumar wrote: > Hi > > I am performing broadcast join where my small table is 1 gb . I am > getting following error . > > I am using > > > org.apache.spark.SparkException: > . Available: 0, required:

Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
Hi I am performing broadcast join where my small table is 1 gb . I am getting following error . I am using org.apache.spark.SparkException: . Available: 0, required: 28869232. To avoid this, increase spark.kryoserializer.buffer.max value I increase the value to

Re: data source v2 online meetup

2018-02-02 Thread Jacek Laskowski
Hi Reynold, That in general is a very good idea to get the community engaged (even if most people would just listen / hide in the dark like myself). I know no other open source project at ASF or elsewhere that such an initiative was even tried. Kudos for the idea! Pozdrawiam, Jacek Laskowski