[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488496#comment-15488496 ] Michael Armbrust commented on SPARK-15406: -- For the types that are coming out, the SQL way would

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488477#comment-15488477 ] Michael Armbrust commented on SPARK-15406: -- Streaming is labeled experimental, we can continue

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487875#comment-15487875 ] Michael Armbrust commented on SPARK-15406: -- Hey Cody, thanks for the input and for sharing your

Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these. my guess would be caching in broken in some cases. On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski wrote: > Hi, > > Can anyone explain why spark.read.csv("people.csv").cache.show ends up > with a WARN while

Re:

2016-08-14 Thread Michael Armbrust
skowski > > > On Sun, Aug 14, 2016 at 9:51 AM, Michael Armbrust > <mich...@databricks.com> wrote: > > Have you tried doing the join in two parts (id == 0 and id != 0) and then > > doing a union of the results? It is possible that with this technique, > that &g

Re:

2016-08-14 Thread Michael Armbrust
Have you tried doing the join in two parts (id == 0 and id != 0) and then doing a union of the results? It is possible that with this technique, that the join which only contains skewed data would be filtered enough to allow broadcasting of one side. On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma

Re: call a mysql stored procedure from spark

2016-08-14 Thread Michael Armbrust
As described here , you can use the DataSource API to connect to an external database using JDBC. While the dbtable option is usually just a table name, it can also be any valid SQL command that returns a

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Michael Armbrust
Anytime you see JaninoRuntimeException you are seeing a bug in our code generation. If you can come up with a small example that causes the problem it would be very helpful if you could open a JIRA. On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar wrote: > I see a similar

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Michael Armbrust
There are two type systems in play here. Spark SQL's and Scala's. >From the Scala side, this is type-safe. After calling as[String]the Dataset will only return Strings. It is impossible to ever get a class cast exception unless you do your own incorrect casting after the fact. Underneath the

Re: Does Spark SQL support indexes?

2016-08-14 Thread Michael Armbrust
Using df.write.partitionBy is similar to a coarse-grained, clustered index in a traditional database. You can't use it on temporary tables, but it will let you efficiently select small parts of a much larger table. On Sat, Aug 13, 2016 at 11:13 PM, Jörn Franke wrote: >

Re: Source API requires unbounded distributed storage?

2016-08-04 Thread Michael Armbrust
Yeah, this API is in the private execution package because we are planning to continue to iterate on it. Today, we will only ever go back one batch, though that might change in the future if we do async checkpointing of internal state. You are totally right that we should relay this info back to

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL. It is telling spark to not even do an if check when accessing that field. In this case, your data *is* nullable, because timestamp is an object in java and you could put null there. On Thu, Aug 4, 2016 at 2:56 PM, luismattor

Re: error while running filter on dataframe

2016-07-31 Thread Michael Armbrust
You are hitting a bug in code generation. If you can come up with a small reproduction for the problem. It would be very helpful if you could open a JIRA. On Sun, Jul 31, 2016 at 9:14 AM, Tony Lane wrote: > Can someone help me understand this error which occurs while

Re: calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Michael Armbrust
Can you share you code? This does not happen for me . On Sun, Jul 31, 2016 at 7:16 AM, Rohit Chaddha

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Michael Armbrust
You have to add a file in resource too (example ). Either that or give a full class name. On Sun, Jul 31, 2016 at 9:45 AM, Ayoub Benali

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Michael Armbrust
Are you sure you are running Spark 2.0? In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 . On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le wrote: > Hi everyone, > Why *MutableInt* cannot be cast to *MutableLong?*

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
park > [error] import org.apache.spark.mllib.linalg.SingularValueDecomposition > [error] ^ > [error] > /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:5: > object mllib is not a member of package org.apache.spark &

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
Also, you'll want all of the various spark versions to be the same. On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you are using %% (double) then you do not need _2.11. > > On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers <sono..

Re: Outer Explode needed

2016-07-25 Thread Michael Armbrust
I don't think this would be hard to implement. The physical explode operator supports it (for our HiveQL compatibility). Perhaps comment on this JIRA? https://issues.apache.org/jira/browse/SPARK-13721 It could probably just be another argument to explode() Michael On Mon, Jul 25, 2016 at 6:12

[jira] [Created] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-16724: Summary: Expose DefinedByConstructorParams Key: SPARK-16724 URL: https://issues.apache.org/jira/browse/SPARK-16724 Project: Spark Issue Type: Bug

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Michael Armbrust
+1 On Fri, Jul 22, 2016 at 2:42 PM, Holden Karau wrote: > +1 (non-binding) > > Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with > a simple structured streaming project (spark-structured-streaming-ml) & > spark-testing-base &

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i

[jira] [Created] (SPARK-16609) Single function for parsing timestamps/dates

2016-07-18 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-16609: Summary: Single function for parsing timestamps/dates Key: SPARK-16609 URL: https://issues.apache.org/jira/browse/SPARK-16609 Project: Spark Issue

[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates

2016-07-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16609: - Target Version/s: 2.1.0 > Single function for parsing timestamps/da

[jira] [Resolved] (SPARK-16531) Remove TimeZone from DataFrameTimeWindowingSuite

2016-07-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16531. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 14170

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2016-07-11 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16483: - Target Version/s: 2.1.0 > Unifying struct fields and colu

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Michael Armbrust
This is protecting you from a limitation in parquet. The library will let you write out invalid files that can't be read back, so we added this check. You can call .format("csv") (in spark 2.0) to switch it to CSV. On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede wrote: > Hi

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
riented Data Scientist > UC Berkeley AMPLab Alumni > > pedrorodriguez.io | 909-353-4423 > github.com/EntilZha | LinkedIn > <https://www.linkedin.com/in/pedrorodriguezscience> > > On July 9, 2016 at 2:19:11 PM, Michael Armbrust (mich...@databricks.com) > wrote: > >

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
You can do whats called an *argmax/argmin*, where you take the min/max of a couple of columns that have been grouped together as a struct. We sort in column order, so you can put the timestamp first. Here is an example

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Michael Armbrust
We are planning to address this issue in the future. At a high level, we'll have to add a delta mode so that updates can be communicated from one operator to the next. On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly wrote: > Indeed. But nested aggregation does not work

[jira] [Commented] (SPARK-8360) Structured Streaming (aka Streaming DataFrames)

2016-07-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359516#comment-15359516 ] Michael Armbrust commented on SPARK-8360: - This kind of question would be better asked

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Michael Armbrust
n RDD inside of a > custom Sink and then doing your operations on that be a reasonable work > around? > > > On Tuesday, June 28, 2016, Michael Armbrust <mich...@databricks.com> > wrote: > >> This is not too broadly worded, and in general I would caution that any

Re: Logging trait in Spark 2.0

2016-06-28 Thread Michael Armbrust
I'd suggest using the slf4j APIs directly. They provide a nice stable API that works with a variety of logging backends. This is what Spark does internally. On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno wrote: > Yes ... the same here ... I'd like to know the best way for

[jira] [Closed] (SPARK-16188) Spark sql create a lot of small files

2016-06-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-16188. Resolution: Not A Bug This is by design and changes would likely be too disruptive

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Michael Armbrust
+1 On Wed, Jun 22, 2016 at 11:33 AM, Jonathan Kelly wrote: > +1 > > On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter > wrote: > >> +1 This release passes all tests on the graphframes and tensorframes >> packages. >> >> On Wed, Jun 22, 2016 at 7:19

Re: cast only some columns

2016-06-21 Thread Michael Armbrust
Use `withColumn`. It will replace a column if you give it the same name. On Tue, Jun 21, 2016 at 4:16 AM, pseudo oduesp wrote: > Hi , > with fillna we can select some columns to perform replace some values > with chosing columns with dict > {columns :values } > but

Re: Question about equality of o.a.s.sql.Row

2016-06-20 Thread Michael Armbrust
> > This is because two objects are compared by "o1 != o2" instead of > "o1.equals(o2)" at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408 Even equals(...) does not do what you want on the JVM: scala> Array(1,2).equals(Array(1,2))

[jira] [Resolved] (SPARK-16050) Flaky Test: Complete aggregation with Console sink

2016-06-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16050. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13776

Re: Hello

2016-06-17 Thread Michael Armbrust
Another good signal is the "target version" (which by convention is only set by committers). When I set this for the upcoming version it means I think its important enough that I will prioritize reviewing a patch for it. On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust
There is no public API for writing encoders at the moment, though we are hoping to open this up in Spark 2.1. What is not working about encoders for options? Which version of Spark are you running? This is working as I would expect?

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Michael Armbrust
+1 to both of these! On Wed, Jun 15, 2016 at 12:21 PM, Sean Owen wrote: > 1.6.2 RC seems fine to me; I don't know of outstanding issues. Clearly > we need to keep the 1.x line going for a bit, so a bug fix release > sounds good, > > Although we've got some work to do before

[jira] [Resolved] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15964. -- Resolution: Won't Fix > Assignment to RDD-typed val fa

[jira] [Commented] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332224#comment-15332224 ] Michael Armbrust commented on SPARK-15964: -- Thanks for reporting this, but I believe

[jira] [Updated] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15915: - Assignee: Takuya Ueshin > CacheManager should use canonicalized plan for planToCa

[jira] [Resolved] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15915. -- Resolution: Fixed Fix Version/s: 2.0.0 > CacheManager should use canonicali

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Michael Armbrust
> > 1) What does this really mean to an Application developer? > It means there are less concepts to learn. > 2) Why this unification was needed in Spark 2.0? > To simplify the API and reduce the number of concepts that needed to be learned. We only didn't do it in 1.6 because we didn't want

[jira] [Updated] (SPARK-15934) Return binary mode in ThriftServer

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15934: - Assignee: Egor Pakhomov Target Version/s: 2.0.0 Priority

[jira] [Updated] (SPARK-15934) Return binary mode in ThriftServer

2016-06-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15934: - Assignee: (was: Egor Pakhomov) > Return binary mode in ThriftSer

Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Michael Armbrust
NoSuchMethodError always means that you are compiling against a different classpath than is available at runtime, so it sounds like you are on the right track. The project is not abandoned, we're just busy with the release. It would be great if you could open a pull request. On Tue, Jun 14,

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Michael Armbrust
You might try with the Spark 2.0 preview. We spent a bunch of time improving the handling of many small files. On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda wrote: > I'm trying to use Spark SQL to load json data that are split across about > 70k > files across 24

Re: Spark Thrift Server in CDH 5.3

2016-06-13 Thread Michael Armbrust
I'd try asking on the cloudera forums. On Sun, Jun 12, 2016 at 9:51 PM, pooja mehta wrote: > Hi, > > How do I start Spark Thrift Server with cloudera CDH 5.3? > > Thanks. >

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Michael Armbrust
Here's a talk I gave on the topic: https://www.youtube.com/watch?v=i7l3JQRx7Qw http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <arunp.bigd...@gmail.com> wrote: > In Spark 2.0, D

[jira] [Resolved] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15489. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13424

[jira] [Resolved] (SPARK-6320) Adding new query plan strategy to SQLContext

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6320. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13147 [https

[jira] [Resolved] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15743. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13486

Re: Spark 2.0 Streaming and Event Time

2016-06-09 Thread Michael Armbrust
There is no special setting for event time (though we will be adding one for setting a watermark in 2.1 to allow us to reduce the amount of state that needs to be kept around). Just window/groupBy on the on the column that is your event time. On Wed, Jun 8, 2016 at 4:12 PM, Chang Lim

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Michael Armbrust
Look at the explain(). For a Seq we know its just local data so avoid spark jobs for simple operations. In contrast, an RDD is opaque to catalyst so we can't perform that optimization. On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski wrote: > Hi, > > I just noticed it today

[jira] [Updated] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15743: - Labels: releasenotes (was: ) > Prevent saving with all-column partition

[jira] [Updated] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15786: - Target Version/s: 2.0.0 > joinWith bytecode generation calling ByteBuffer.w

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Michael Armbrust
t, int)" > > The generated code is passing InternalRow objects into the ByteBuffer > > Starting from two Datasets of types Dataset[(Int, Int)] with expression > $"left._1" === $"right._1". I'll have to spend some time getting a better > understanding of

[jira] [Updated] (SPARK-15732) Dataset generated code "generated.java" Fails with Certain Case Classes

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15732: - Priority: Critical (was: Major) > Dataset generated code "generated.jav

[jira] [Updated] (SPARK-15732) Dataset generated code "generated.java" Fails with Certain Case Classes

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15732: - Target Version/s: 2.0.0 > Dataset generated code "generated.java" Fails

[jira] [Commented] (SPARK-12931) Improve bucket read path to only create one single RDD

2016-06-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312702#comment-15312702 ] Michael Armbrust commented on SPARK-12931: -- It was fixed in: https://github.com/apache/spark

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
> > I'd think we want less effort, not more, to let people test it? for > example, right now I can't easily try my product build against > 2.0.0-preview. I don't feel super strongly one way or the other, so if we need to publish it permanently we can. However, either way you can still test

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Michael Armbrust
Yeah, we don't usually publish RCs to central, right? On Wed, Jun 1, 2016 at 1:06 PM, Reynold Xin wrote: > They are here ain't they? > > https://repository.apache.org/content/repositories/orgapachespark-1182/ > > Did you mean publishing them to maven central? My

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
thing else, I guess Option doesn't have a first class Encoder or DataType > yet and maybe for good reasons. > > I did find the RDD join interface elegant, though. In the ideal world an > API comparable the following would be nice: > https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed9

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425 On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in

[jira] [Resolved] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming

2016-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15686. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13429

Re: Map tuple to case class in Dataset

2016-06-01 Thread Michael Armbrust
t;> +---+ >>> >>> FYI >>> >>> On Tue, May 31, 2016 at 7:35 PM, Tim Gautier <tim.gaut...@gmail.com> >>> wrote: >>> >>>> 1.6.1 The exception is a null pointer exception. I'll paste the whole >>>> thing after I fire

Re: Map tuple to case class in Dataset

2016-05-31 Thread Michael Armbrust
Version of Spark? What is the exception? On Tue, May 31, 2016 at 4:17 PM, Tim Gautier wrote: > How should I go about mapping from say a Dataset[(Int,Int)] to a > Dataset[]? > > I tried to use a map, but it throws exceptions: > > case class Test(a: Int) >

[jira] [Commented] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306831#comment-15306831 ] Michael Armbrust commented on SPARK-15654: -- Thanks for point this out! Looks like we need

[jira] [Updated] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15654: - Target Version/s: 2.0.0 > Reading gzipped files results in duplicate r

[jira] [Updated] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15654: - Priority: Blocker (was: Critical) > Reading gzipped files results in duplicate r

[jira] [Commented] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-05-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306826#comment-15306826 ] Michael Armbrust commented on SPARK-15489: -- As soon as you open a PR it will auto assign

Re: Undocumented left join constraint?

2016-05-27 Thread Michael Armbrust
Sounds like: https://issues.apache.org/jira/browse/SPARK-15441, for which a fix is in progress. Please do keep reporting issues though, these are great! Michael On Fri, May 27, 2016 at 1:01 PM, Tim Gautier wrote: > Is it truly impossible to left join a Dataset[T] on the

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory. A rough example can be found in TestHive. Note: in Spark 2.0 there should be no need to use HiveContext unless you need to talk to a metastore. On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh

[jira] [Updated] (SPARK-15483) IncrementalExecution should use extra strategies.

2016-05-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15483: - Assignee: Takuya Ueshin > IncrementalExecution should use extra strateg

[jira] [Resolved] (SPARK-15483) IncrementalExecution should use extra strategies.

2016-05-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15483. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13261

Re: feedback on dataset api explode

2016-05-25 Thread Michael Armbrust
These APIs predate Datasets / encoders, so that is why they are Row instead of objects. We should probably rethink that. Honestly, I usually end up using the column expression version of explode now that it exists (i.e. explode($"arrayCol").as("Item")). It would be great to understand more why

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Michael Armbrust
> > i can't give you permissions -- that has to be (most likely) through > someone @ databricks, like michael. > Another clarification: not databricks, but the Apache Spark PMC grants access to the JIRA / wiki. That said... I'm not actually sure how its done.

Re: Dataset Set Operations

2016-05-24 Thread Michael Armbrust
What is the schema of the case class? On Tue, May 24, 2016 at 3:46 PM, Tim Gautier wrote: > Hello All, > > I've been trying to subtract one dataset from another. Both datasets > contain case classes of the same type. When I subtract B from A, I end up > with a copy of A

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298883#comment-15298883 ] Michael Armbrust commented on SPARK-15489: -- It should run in the same JVM when running in local

[jira] [Updated] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15489: - Target Version/s: 2.0.0 > Dataset kryo encoder fails on Collecti

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297095#comment-15297095 ] Michael Armbrust commented on SPARK-15489: -- Wild guess... https://github.com/apache/spark/blob

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296964#comment-15296964 ] Michael Armbrust commented on SPARK-15489: -- Also, does this problem exist in the 2.0 preview

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296963#comment-15296963 ] Michael Armbrust commented on SPARK-15489: -- Is your registration making into the instance

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust
Can you open a JIRA? On Sun, May 22, 2016 at 2:50 PM, Amit Sela wrote: > I've been using Encoders with Kryo to support encoding of generically > typed Java classes, mostly with success, in the following manner: > > public static Encoder encoder() { > return

Re: Dataset API and avro type

2016-05-23 Thread Michael Armbrust
AccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) > > 2016-05-22 22:02 GMT+02:00 Michael Armbrust <mich...@databricks.com>: > >

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Michael Armbrust
We did turn on travis a few years ago, but ended up turning it off because it was failing (I believe because of insufficient resources) which was confusing for developers. I wouldn't be opposed to turning it on if it provides more/faster signal, but its not obvious to me that it would. In

[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296805#comment-15296805 ] Michael Armbrust commented on SPARK-15140: -- I don't think you should ever get a null row back

[jira] [Resolved] (SPARK-15471) ScalaReflection cleanup

2016-05-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15471. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13250

Re: Dataset API and avro type

2016-05-20 Thread Michael Armbrust
What is the error? I would definitely expect it to work with kryo at least. On Fri, May 20, 2016 at 2:37 AM, Han JU wrote: > Hello, > > I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However > it does not seems to work with Avro data types: > > >

Re: Wide Datasets (v1.6.1)

2016-05-20 Thread Michael Armbrust
> > I can provide an example/open a Jira if there is a chance this will be > fixed. > Please do! Ping me on it. Michael

[jira] [Resolved] (SPARK-15190) Support using SQLUserDefinedType for case classes

2016-05-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15190. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12965

[jira] [Updated] (SPARK-15308) RowEncoder should preserve nested column name.

2016-05-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15308: - Target Version/s: 2.0.0 > RowEncoder should preserve nested column n

[jira] [Updated] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.

2016-05-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15313: - Target Version/s: 2.0.0 > EmbedSerializerInFilter rule should keep exprIds of out

[jira] [Updated] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15416: - Assignee: Shixiong Zhu > Display a better message for not finding classes remo

[jira] [Resolved] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15416. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13201

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Michael Armbrust
> > 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)” > doesn’t work because “HiveContext not a member of > org.apache.spark.sql.hive” I checked the documentation, and it looks like > it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz > HiveContext has been

<    3   4   5   6   7   8   9   10   11   12   >