Re: Dataset Set Operations

2016-05-24 Thread Michael Armbrust
What is the schema of the case class? On Tue, May 24, 2016 at 3:46 PM, Tim Gautier wrote: > Hello All, > > I've been trying to subtract one dataset from another. Both datasets > contain case classes of the same type. When I subtract B from A, I end up > with a copy of A

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust
Can you open a JIRA? On Sun, May 22, 2016 at 2:50 PM, Amit Sela wrote: > I've been using Encoders with Kryo to support encoding of generically > typed Java classes, mostly with success, in the following manner: > > public static Encoder encoder() { > return

Re: Dataset API and avro type

2016-05-23 Thread Michael Armbrust
AccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) > > 2016-05-22 22:02 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > >

Re: Dataset API and avro type

2016-05-20 Thread Michael Armbrust
What is the error? I would definitely expect it to work with kryo at least. On Fri, May 20, 2016 at 2:37 AM, Han JU wrote: > Hello, > > I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However > it does not seems to work with Avro data types: > > >

Re: Wide Datasets (v1.6.1)

2016-05-20 Thread Michael Armbrust
> > I can provide an example/open a Jira if there is a chance this will be > fixed. > Please do! Ping me on it. Michael

Re: [Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Michael Armbrust
The state store for structured streaming is an internal concept, and isn't designed to be consumed by end users. I'm hoping to write some documentation on how to do aggregation, but support for reading from Kafka and other sources will likely come in Spark 2.1+ On Wed, May 18, 2016 at 3:50 AM,

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust
In 2.0 you won't be able to do this. The long term vision would be to make this possible, but a window will be required (like the 24 hours you suggest). On Tue, May 17, 2016 at 1:36 AM, Todd wrote: > Hi, > We have a requirement to do count(distinct) in a processing batch

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Michael Armbrust
I don't think that you will be able to do that. ScalaReflection is based on the TypeTag of the object, and thus the schema of any particular object won't be available to it. Instead I think you want to use the register functions in UDFRegistration that take a schema. Does that make sense? On

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Michael Armbrust
> > > logical plan after optimizer execution: > > Project [id#0L,id#1L] > !+- Filter (id#0L = cast(1 as bigint)) > ! +- Join Inner, Some((id#0L = id#1L)) > ! :- Subquery t > ! : +- Relation[id#0L] JSONRelation > ! +- Subquery u > ! +- Relation[id#1L] JSONRelation >

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Armbrust
That is a forward looking design doc and not all of it has been implemented yet. With Spark 2.0 the main sources and sinks will be file based, though we hope to quickly expand that now that a lot of infrastructure is in place. On Fri, May 6, 2016 at 2:11 PM, Ted Yu wrote:

Re: Accessing JSON array in Spark SQL

2016-05-05 Thread Michael Armbrust
use df.selectExpr to evaluate complex expression (instead of just column names). On Thu, May 5, 2016 at 11:53 AM, Xinh Huynh wrote: > Hi, > > I am having trouble accessing an array element in JSON data with a > dataframe. Here is the schema: > > val json1 = """{"f1":"1",

Re: How can I bucketize / group a DataFrame from parquet files?

2016-04-27 Thread Michael Armbrust
Unfortunately, I don't think there is an easy way to do this in 1.6. In Spark 2.0 we will make DataFrame = Dataset[Row], so this should work out of the box. On Mon, Apr 25, 2016 at 11:08 PM, Brandon White wrote: > I am creating a dataFrame from parquet files. The

Re: XML Data Source for Spark

2016-04-25 Thread Michael Armbrust
You are using a version of the library that was compiled for a different version of Scala than the version of Spark that you are using. Make sure that they match up. On Mon, Apr 25, 2016 at 5:19 PM, Mohamed ismail wrote: > here is an example with code. >

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Michael Armbrust
When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method. Spark doesn't have access to this scope, so this makes it hard (impossible?) for us to construct new instances of that class. So, define your classes that you plan to use with Spark at the

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Re: Dataset aggregateByKey equivalent

2016-04-23 Thread Michael Armbrust
Have you looked at aggregators? https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html On Fri, Apr 22, 2016 at 6:45 PM, Lee Becker wrote: > Is there a way to do aggregateByKey on Datasets the way one can on an RDD? > > Consider the

Re: prefix column Spark

2016-04-19 Thread Michael Armbrust
A few comments: - Each withColumnRename is adding a new level to the logical plan. We have optimized this significantly in newer versions of Spark, but it is still not free. - Transforming to an RDD is going to do fairly expensive conversion back and forth between the internal binary format. -

Re: Will nested field performance improve?

2016-04-15 Thread Michael Armbrust
> > If we expect fields nested in structs to always be much slower than flat > fields, then I would be keen to address that in our ETL pipeline with a > flattening step. If it's a known issue that we expect will be fixed in > upcoming releases, I'll hold off. > The difference might be even larger

Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust
You need to use `backticks` to reference columns that have non-standard characters. On Wed, Apr 13, 2016 at 6:56 AM, wrote: > Hi, > > I am debugging a program, and for some reason, a line calling the > following is failing: > > df.filter("sum(OpenAccounts) >

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
; How would you connect to Hive for some data and then reach out to lets say > Oracle or DB2 for some other data that you may want but isn’t available on > your cluster? > > > On Apr 12, 2016, at 10:52 AM, Michael Armbrust <mich...@databricks.com> > wrote: > > You can, bu

Re: Aggregator support in DataFrame

2016-04-12 Thread Michael Armbrust
). > > On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> saw that, dont think it solves it. i basically want to add some children >> to the expression i guess, to indicate what i am operating on? not sure if >> even makes sense >>

Re: ordering over structs

2016-04-12 Thread Michael Armbrust
gt; .groupBy("customer_id")\ > .agg(min("vs").alias("final"))\ > .select("customer_id", "final.dt", "final.product") > df.head() > > My log from the non-cached run: > http://pastebin.com/F88sSv1B > > Log fro

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You can, but I'm not sure why you would want to. If you want to isolate different users just use hiveContext.newSession(). On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande wrote: > Hi, > > Is it possible to have both a sqlContext and a hiveContext in the same > application

Re: Aggregator support in DataFrame

2016-04-11 Thread Michael Armbrust
I'll note this interface has changed recently: https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d I'm not sure that solves your problem though... On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers wrote: > i like the Aggregator a lot

Re: ordering over structs

2016-04-08 Thread Michael Armbrust
or saying: > > AttributeError: 'StructType' object has no attribute 'alias' > > Can I do this without aliasing the struct? Or am I doing something > incorrectly? > > > regards, > > imran > > On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > Ordering for a struct goes in order of the fields. So the max struct is > the one with the highest TotalValue (and then the highest category > if there are multiple entries with the same hour and total value). > > Is this due to "InterpretedOrdering" in StructType? > That is one

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > 1) Is a struct in Spark like a struct in C++? > Kinda. Its an ordered collection of data with known names/types. > 2) What is an alias in this context? > it is assigning a name to the column. similar to doing AS in sql. > 3) How does this code even work? > Ordering for a struct

Re: Using an Option[some primitive type] in Spark Dataset API

2016-04-06 Thread Michael Armbrust
> We only define implicits for a subset of the types we support in > SQLImplicits > . > We should probably consider adding Option[T] for common T as the internal > infrastructure

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-04-06 Thread Michael Armbrust
> > Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] Use toDS() and you can skip the .as[Text] > Sure! It works with map, but not with select. Wonder if it's by design > or...will soon be fixed? Thanks again for your help. This is by design. select is relational and works with column

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
act scenario when it will prune partitions. I am bit > confused now. Isnt there a way to see the exact partition pruning? > > Thanks > > On Tue, Apr 5, 2016 at 8:59 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> For the in-memory cache, we still launch tasks,

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
9.2 KB (memory) / 1 42.0 B / 1 > 5 32 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 2 > ms 5 ms 0 ms 0 ms 0.0 B 60.3 KB (memory) / 1 42.0 B / 1 > 6 33 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 3 > ms 4 ms 0 ms 0 ms 0.0 B 70.3 KB (memor

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Michael Armbrust
You should generally think of a DataFrame as unordered, unless you are explicitly asking for an order. One way to order and assign an index is with window functions . On Tue, Apr 5, 2016 at 4:17 AM, Angel

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
Can you show your full code. How are you partitioning the data? How are you reading it? What is the resulting query plan (run explain() or EXPLAIN). On Tue, Apr 5, 2016 at 10:02 AM, dsing001 wrote: > HI, > > I am using 1.5.2. I have a dataframe which is partitioned

Re: [Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Michael Armbrust
What error are you getting? Here is an example . External types are documented here:

Re: Support for time column type?

2016-04-01 Thread Michael Armbrust
There is also CalendarIntervalType. Is that what you are looking for? On Fri, Apr 1, 2016 at 1:11 PM, Philip Weaver wrote: > Hi, I don't see any mention of a time type in the documentation (there is > DateType and TimestampType, but not TimeType), and have been unable

Re: pyspark read json file with high dimensional sparse data

2016-03-30 Thread Michael Armbrust
You can force the data to be loaded as a sparse map assuming the key/value types are consistent. Here is an example . On Wed, Mar 30,

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
cs/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") > > If so, it returns the same error: > > java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths:? > hdfs://user/hdfs/analytics/app1/PAGEVIEW > hdfs://user/hdf

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Michael Armbrust
I would not recommend using the direct output committer with HDFS. Its intended only as an optimization for S3. On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar wrote: > Hi, > > We are doing the following to save a dataframe in parquet (using > DirectParquetOutputCommitter) as

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

Re: Column explode a map

2016-03-24 Thread Michael Armbrust
If you know the map keys ahead of time then you can just extract them directly. Here are a few examples . On Thu, Mar 24, 2016 at 12:01

Re: calling individual columns from spark temporary table

2016-03-24 Thread Michael Armbrust
).map(x => > (x.getString(0),x.getString(1).) > > Can you give an example of column expression please > like > > df.filter(col("paid") > "").col("firstcolumn").getString ? > > > > > On Thursday, 24 March 2016, 0:45, Micha

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
s there anyway one can keep the csv column names using databricks when > mapping > > val r = df.filter(col("paid") > "").map(x => > (x.getString(0),x.getString(1).) > > can I call example x.getString(0).as.(firstcolumn) in above when mapping > if poss

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: > Gurus, > > If I register a temporary table as below > > r.toDF > res58: org.apache.spark.sql.DataFrame =

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust
> > But when tired using Spark streamng I could not find a way to store the > data with the avro schema information. The closest that I got was to create > a Dataframe using the json RDDs and store them as parquet. Here the parquet > files had a spark specific schema in their footer. > Does this

Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust
It's helpful if you can include the output of EXPLAIN EXTENDED or df.explain(true) whenever asking about query performance. On Mon, Mar 21, 2016 at 6:27 AM, gtinside wrote: > Hi , > > I am trying to execute a simple query with join on 3 tables. When I look at > the execution

Re: Subquery performance

2016-03-20 Thread Michael Armbrust
t? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* March-17-16 8:59 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: Subquery performance > > > > Try running EXPLAIN on both version of the query.

Re: Subquery performance

2016-03-19 Thread Michael Armbrust
Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
On Mon, Mar 14, 2016 at 1:30 PM, Prabhu Joseph wrote: > > Thanks for the recommendation. But can you share what are the > improvements made above Spark-1.2.1 and how which specifically handle the > issue that is observed here. > Memory used for query execution is

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
+1 to upgrading Spark. 1.2.1 has non of the memory management improvements that were added in 1.4-1.6. On Mon, Mar 14, 2016 at 2:03 AM, Prabhu Joseph wrote: > The issue is the query hits OOM on a Stage when reading Shuffle Output > from previous stage.How come

Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust
> > Each json file is of a single object and has the potential to have > variance in the schema. > How much variance are we talking? JSON->Parquet is going to do well with 100s of different columns, but at 10,000s many things will probably start breaking.

Re: Can someone fix this download URL?

2016-03-14 Thread Michael Armbrust
Yeah, sorry. I'll make sure this gets fixed. On Mon, Mar 14, 2016 at 12:48 AM, Sean Owen wrote: > Yeah I can't seem to download any of the artifacts via the direct download > / cloudfront URL. The Apache mirrors are fine, so use those for the moment. > @marmbrus were you

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust
Or look at explode on DataFrame On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov wrote: > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust
df.select("event").toJSON On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius wrote: > Hmm. I think my problem is a little more complex. I'm using > https://github.com/databricks/spark-redshift and when I read from JSON > file I got this schema. > > root > > |-- app: string

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once,

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the

Re: Spark 1.5.2 : change datatype in programaticallly generated schema

2016-03-04 Thread Michael Armbrust
Change the type of a subset of the columns using withColumn, after you have loaded the DataFrame. Here is an example. On Thu, Mar

Re: Spark SQL - udf with entire row as parameter

2016-03-04 Thread Michael Armbrust
You have to use SQL to call it (but you will be able to do it with dataframes in Spark 2.0 due to a better parser). You need to construct a struct(*) and then pass that to your function since a function must have a fixed number of arguments. Here is an example

Re: Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Michael Armbrust
Read the docs at the link that you pasted: http://spark.apache.org/docs/latest/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore Spark will always compile against the same version of Hive (1.2.1), but it can dynamically load jars to speak to other versions. On Fri,

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Michael Armbrust
-dev +user StructType(StructField(data,ArrayType(StructType(StructField( > *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), > StructField(name,StringType,true)),true),true), StructField(othertype, >

Re: DataSet Evidence

2016-03-01 Thread Michael Armbrust
Hey Steve, This isn't possible today, but it would not be hard to allow. You should open a feature request JIRA. Michael On Mon, Feb 29, 2016 at 4:55 PM, Steve Lewis wrote: > I have a relatively complex Java object that I would like to use in a > dataset > > if I say

Re: Mapper side join with DataFrames API

2016-03-01 Thread Michael Armbrust
Its helpful to always include the output of df.explain(true) when you are asking about performance. On Mon, Feb 29, 2016 at 6:14 PM, Deepak Gopalakrishnan wrote: > Hello All, > > I'm trying to join 2 dataframes A and B with a > > sqlContext.sql("SELECT * FROM A INNER JOIN B ON

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michael Armbrust
There will probably be some subquery support in 2.0. That particular query would be more efficient to express as an argmax however. Here is an example in Spark 1.6

Re: d.filter("id in max(id)")

2016-02-26 Thread Michael Armbrust
You can do max on a struct to get the max value for the first column, along with the values for other columns in the row (an argmax) Here is an example

Re: Filter on a column having multiple values

2016-02-24 Thread Michael Armbrust
You can do this either with expr("... IN ...") or isin. Here is a full example . On Wed, Feb 24, 2016 at 2:40 PM, Ashok Kumar

Re: How to Exploding a Map[String,Int] column in a DataFrame (Scala)

2016-02-24 Thread Michael Armbrust
You can do this using the explode function defined in org.apache.spark.sql.functions. Here is some example code . On Wed, Feb 24,

Re: Serializing collections in Datasets

2016-02-22 Thread Michael Armbrust
I think this will be fixed in 1.6.1. Can you test when we post the first RC? (hopefully later today) On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Experimenting with datasets in Spark 1.6.0 I ran into a serialization > error when using case classes

Re: equalTo isin not working as expected with a constructed column with DataFrames

2016-02-19 Thread Michael Armbrust
Can you include the output of explain(true) on the dataframe in question. It would also be really helpful to see a small code fragment that reproduces the issue. On Thu, Feb 18, 2016 at 9:10 AM, Mehdi Ben Haj Abbes wrote: > Hi, > I forgot to mention that I'm using the

Re: Spark Job Hanging on Join

2016-02-19 Thread Michael Armbrust
Please include the output of running explain() when reporting performance issues with DataFrames. On Fri, Feb 19, 2016 at 9:31 AM, Tamara Mendt wrote: > Hi all, > > I am running a Spark job that gets stuck attempting to join two > dataframes. The dataframes are not very

Re: trouble using Aggregator with DataFrame

2016-02-17 Thread Michael Armbrust
Glad you like it :) This sounds like a bug, and we should fix it as we merge DataFrame / Dataset for 2.0. Could you open JIRA targeted at 2.0? On Wed, Feb 17, 2016 at 2:22 PM, Koert Kuipers wrote: > first of all i wanted to say that i am very happy to see >

Re: cartesian with Dataset

2016-02-17 Thread Michael Armbrust
You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely. On Wed, Feb 17, 2016 at 4:08 AM, Alex Dzhagriev wrote: > Hello all, > > Is anybody aware of any plans to support cartesian for

Re: Dataset takes more memory compared to RDD

2016-02-15 Thread Michael Armbrust
What algorithm? Can you provide code? On Fri, Feb 12, 2016 at 3:22 PM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > Hello All, > > I implemented an algorithm using both the RDDs and the Dataset API (in > Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this >

Re: GroupedDataset needs a mapValues

2016-02-13 Thread Michael Armbrust
Instead of grouping with a lambda function, you can do it with a column expression to avoid materializing an unnecessary tuple: df.groupBy($"_1") Regarding the mapValues, you can do something similar using an Aggregator

Re: org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-13 Thread Michael Armbrust
selectExpr just uses the SQL parser to interpret the string you give it. So to get a string literal you would use quotes: df.selectExpr("*", "'" + time.miliseconds() + "' AS ms") On Fri, Feb 12, 2016 at 6:19 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am trying to add a column

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Michael Armbrust
> > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE > compute statistics" command in Hive shell, is the statistics > going to be used by SparkSQL to decide broadcast join? Yes, spark SQL will only accept the simple no scan version. However, as long as the sizeInBytes

Re: Dataset Encoders for SparseVector

2016-02-04 Thread Michael Armbrust
We are hoping to add better support for UDTs in the next release, but for now you can use kryo to generate an encoder for any class: implicit val vectorEncoder = org.apache.spark.sql.Encoders.kryo[SparseVector] On Thu, Feb 4, 2016 at 12:22 PM, raj.kumar wrote: > Hi, >

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Michael Armbrust
On Wed, Feb 3, 2016 at 1:42 PM, Nirav Patel wrote: > Awesome! I just read design docs. That is EXACTLY what I was talking > about! Looking forward to it! > Great :) Most of the API is there in 1.6. For the next release I would like to unify DataFrame <-> Dataset and do

Re: optimal way to load parquet files with partition

2016-02-02 Thread Michael Armbrust
It depends how many partitions you have and if you are only doing a single operation. Loading all the data and filtering will require us to scan the directories to discover all the months. This information will be cached. Then we should prune and avoid reading unneeded data. Option 1 does not

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Michael Armbrust
> > A principal difference between RDDs and DataFrames/Datasets is that the > latter have a schema associated to them. This means that they support only > certain types (primitives, case classes and more) and that they are > uniform, whereas RDDs can contain any serializable object and must not >

Re: Spark 2.0.0 release plan

2016-01-29 Thread Michael Armbrust
ark builds to > Scala > > 2.11 with Spark 2.0? > > > > Regards > > Deenar > > > > On 27 January 2016 at 19:55, Michael Armbrust <mich...@databricks.com> > > wrote: > >> > >> We do maintenance releases on demand when there is enough to justify >

Re: Broadcast join on multiple dataframes

2016-01-28 Thread Michael Armbrust
Can you provide the analyzed and optimized plans (explain(true)) On Thu, Jan 28, 2016 at 12:26 PM, Srikanth wrote: > Hello, > > I have a use case where one large table has to be joined with several > smaller tables. > I've added broadcast hint for all small tables in the

Re: Spark 2.0.0 release plan

2016-01-27 Thread Michael Armbrust
We do maintenance releases on demand when there is enough to justify doing one. I'm hoping to cut 1.6.1 soon, but have not had time yet. On Wed, Jan 27, 2016 at 8:12 AM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Will there continue to be monthly releases on the 1.6.x branch during

Re: NPE from sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply?

2016-01-26 Thread Michael Armbrust
That is a bug in generated code. It would be great if you could post a reproduction. On Tue, Jan 26, 2016 at 9:15 AM, Jacek Laskowski wrote: > Hi, > > Does this say anything to anyone? :) It's with Spark 2.0.0-SNAPSHOT > built today. Is this something I could fix myself in my

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Michael Armbrust
Looks like you found a bug. I've filed them here: SPARK-12987 - Drop fails when columns contain dots SPARK-12988 - Can't drop columns that contain dots On Fri, Jan 22, 2016 at 3:18 PM, Joshua

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
The encoder is responsible for mapping your class onto some set of columns. Try running: datasetMyType.printSchema() On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis wrote: > assume I have the following code > > SparkConf sparkConf = new SparkConf(); > > JavaSparkContext

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
he schema it looks like KRYO makes one column is there > a way to do a custom encoder with my own columns > On Jan 25, 2016 1:30 PM, "Michael Armbrust" <mich...@databricks.com> > wrote: > >> The encoder is responsible for mapping your class onto some set of >>

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust
The analog to PairRDD is a GroupedDataset (created by calling groupBy), which offers similar functionality, but doesn't require you to construct new object that are in the form of key/value pairs. It doesn't matter if they are complex objects, as long as you can create an encoder for them

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust
the issue of looking at schools in > neighboring regions > > On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> The analog to PairRDD is a GroupedDataset (created by calling groupBy), >> which offers similar functionality, b

Re: Redundant common columns of nature full outer join

2016-01-20 Thread Michael Armbrust
If you use the join that takes USING columns it should automatically coalesce (take the non null value from) the left/right columns: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L405 On Tue, Jan 19, 2016 at 10:51 PM, Zhong Wang

Re: Spark SQL -Hive transactions support

2016-01-19 Thread Michael Armbrust
We don't support Hive style transaction. On Tue, Jan 19, 2016 at 11:32 AM, hnagar wrote: > Hive has transactions support since version 0.14. > > I am using Spark 1.6, and Hive 1.2.1, are transactions supported in Spark > SQL now. I tried in the Spark-Shell and it

Re: Spark Dataset doesn't have api for changing columns

2016-01-19 Thread Michael Armbrust
In Spark 2.0 we are planning to combine DataFrame and Dataset so that all the methods will be available on either class. On Tue, Jan 19, 2016 at 3:42 AM, Milad khajavi wrote: > Hi Spark users, > > when I want to map the result of count on groupBy, I need to convert the >

Re: Serializing DataSets

2016-01-18 Thread Michael Armbrust
What error? On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner <reactorm...@gmail.com> wrote: > And for deserializing, > `sqlContext.read.parquet("path/to/parquet").as[T]` and catch the > error? > > 2016-01-14 3:43 GMT+08:00 Michael Armbrust <mich...@databricks.

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Michael Armbrust
See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546 On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam wrote: > Hi Arkadiusz, > > the partitionBy is not designed to have many distinct value the last time > I used it. If you search in the mailing list,

Re: SQL UDF problem (with re to types)

2016-01-14 Thread Michael Armbrust
We automatically convert types for UDFs defined in Scala, but we can't do it in Java because the types are erased by the compiler. If you want to use double you should cast before calling the UDF. On Wed, Jan 13, 2016 at 8:10 PM, Raghu Ganti wrote: > So, when I try

Re: SQL UDF problem (with re to types)

2016-01-14 Thread Michael Armbrust
type erasure is solved through proper generics > implementation in Java 1.8). > > On Thu, Jan 14, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> We automatically convert types for UDFs defined in Scala, but we can't do >> it in Java because

Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Michael Armbrust
The focus of this release was to get the API out there and there's a lot of low hanging performance optimizations. That said, there is likely always going to be some cost of materializing objects. Another note, anytime your comparing performance its useful to include the output of explain so we

Re: Serializing DataSets

2016-01-13 Thread Michael Armbrust
Yeah, thats the best way for now (note the conversion is purely logical so there is no cost of calling toDF()). We'll likely be combining the classes in Spark 2.0 to remove this awkwardness. On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner wrote: > What's the proper way to

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust
> > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > > This looks like a bug. What is the error? It might be fixed in branch-1.6/master if you can test there. > Please advice on what I may be missing here? > > > Also for join, may I suggest to have a custom encoder / transformation to >

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust
.scala:861) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1607) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProce

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Michael Armbrust
There can be dataloss when you are using the DirectOutputCommitter and speculation is turned on, so we disable it automatically. On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam wrote: > Hi spark users and developers, > > I wonder if the following observed behaviour is expected.

<    1   2   3   4   5   6   7   8   9   10   >