[jira] [Updated] (SPARK-16026) Cost-based Optimizer framework

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16026: - Target Version/s: 2.3.0 (was: 2.2.0) > Cost-based Optimizer framew

[jira] [Updated] (SPARK-18543) SaveAsTable(CTAS) using overwrite could change table definition

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18543: - Target Version/s: 2.3.0 (was: 2.2.0) > SaveAsTable(CTAS) using overwrite could cha

[jira] [Updated] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18950: - Target Version/s: 2.3.0 (was: 2.2.0) > Report conflicting fields when merging

[jira] [Updated] (SPARK-15117) Generate code that get a value in each compressed column from CachedBatch when DataFrame.cache() is called

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15117: - Target Version/s: 2.3.0 (was: 2.2.0) > Generate code that get a value in e

[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17626: - Target Version/s: 2.3.0 (was: 2.2.0) > TPC-DS performance improvements using s

[jira] [Updated] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15867: - Target Version/s: 2.3.0 (was: 2.2.0) > Use bucket files for TABLESAMPLE BUC

[jira] [Updated] (SPARK-16275) Implement all the Hive fallback functions

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16275: - Target Version/s: 2.3.0 (was: 2.2.0) > Implement all the Hive fallback functi

[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12978: - Target Version/s: 2.3.0 (was: 2.2.0) > Skip unnecessary final group-by when input d

[jira] [Updated] (SPARK-16412) Generate Java code that gets an array in each column of CachedBatch when DataFrame.cache() is called

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16412: - Target Version/s: 2.3.0 (was: 2.2.0) > Generate Java code that gets an array in e

[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4502: Target Version/s: 2.3.0 (was: 2.2.0) > Spark SQL reads unneccesary nested fields f

[jira] [Updated] (SPARK-16217) Support SELECT INTO statement

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16217: - Target Version/s: 2.3.0 (was: 2.2.0) > Support SELECT INTO statem

[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18084: - Target Version/s: 2.3.0 (was: 2.2.0) > write.partitionBy() does not recognize nes

[jira] [Updated] (SPARK-17924) Consolidate streaming and batch write path

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17924: - Target Version/s: 2.3.0 (was: 2.2.0) > Consolidate streaming and batch write p

[jira] [Updated] (SPARK-19150) completely support using hive as data source to create tables

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19150: - Target Version/s: 2.3.0 (was: 2.2.0) > completely support using hive as data sou

[jira] [Updated] (SPARK-16452) basic INFORMATION_SCHEMA support

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16452: - Target Version/s: 2.3.0 (was: 2.2.0) > basic INFORMATION_SCHEMA supp

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16483: - Target Version/s: 2.3.0 (was: 2.2.0) > Unifying struct fields and colu

[jira] [Updated] (SPARK-16390) Dataset API improvements

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16390: - Target Version/s: 2.3.0 (was: 2.2.0) > Dataset API improveme

[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16196: - Target Version/s: 2.3.0 (was: 2.2.0) > Optimize in-memory scan performance us

[jira] [Updated] (SPARK-7768) Make user-defined type (UDT) API public

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7768: Target Version/s: 2.3.0 (was: 2.2.0) > Make user-defined type (UDT) API pub

[jira] [Updated] (SPARK-17203) data source options should always be case insensitive

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17203: - Target Version/s: 2.3.0 (was: 2.2.0) > data source options should always be c

[jira] [Updated] (SPARK-19242) SHOW CREATE TABLE should generate new syntax to create hive table

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19242: - Target Version/s: 2.3.0 (was: 2.2.0) > SHOW CREATE TABLE should generate new syn

[jira] [Updated] (SPARK-17528) MutableProjection should not cache content from the input row

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17528: - Target Version/s: 2.3.0 (was: 2.2.0) > MutableProjection should not cache content f

[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16323: - Target Version/s: 2.3.0 (was: 2.2.0) > Avoid unnecessary cast when doing integ

[jira] [Resolved] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-20854. -- Resolution: Fixed https://github.com/apache/spark/pull/18086 > extend hint syn

[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes

2017-06-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15420: - Target Version/s: 2.3.0 (was: 2.2.0) > Repartition and sort before Parquet wri

[jira] [Updated] (SPARK-20940) AccumulatorV2 should not throw IllegalAccessError

2017-05-31 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20940: - Target Version/s: 2.2.0 > AccumulatorV2 should not throw IllegalAccessEr

Re: Running into the same problem as JIRA SPARK-20325

2017-05-31 Thread Michael Armbrust
> > So, my question is the same as stated in the following ticket which is Do > we need create a checkpoint directory for each individual query? > Yes. Checkpoints record what data has been processed. Thus two different queries need their own checkpoints.

[jira] [Created] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-05-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20928: Summary: Continuous Processing Mode for Structured Streaming Key: SPARK-20928 URL: https://issues.apache.org/jira/browse/SPARK-20928 Project: Spark

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-30 Thread Michael Armbrust
> > Michael, > > If you haven't started cutting the new RC, I'm working on a documentation > PR right now I'm hoping we can get into Spark 2.2 as a migration note, even > if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888. > > Michael > >

[jira] [Updated] (SPARK-20462) Spark-Kinesis Direct Connector

2017-05-26 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20462: - Component/s: (was: Input/Output) DStreams > Spark-Kinesis Dir

[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die

2017-05-26 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026634#comment-16026634 ] Michael Armbrust commented on SPARK-20843: -- I don't have much context here /cc [~zsxwing

[jira] [Commented] (SPARK-20897) cached self-join should not fail

2017-05-26 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026588#comment-16026588 ] Michael Armbrust commented on SPARK-20897: -- Is this a regression? If so, can you please make

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Michael Armbrust
-dev Have you tried clearing out the checkpoint directory? Can you also give the full stack trace? On Wed, May 24, 2017 at 3:45 PM, kant kodali wrote: > Even if I do simple count aggregation like below I get the same error as >

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Michael Armbrust
-dev Have you tried clearing out the checkpoint directory? Can you also give the full stack trace? On Wed, May 24, 2017 at 3:45 PM, kant kodali wrote: > Even if I do simple count aggregation like below I get the same error as >

[jira] [Updated] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"

2017-05-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20865: - Description: {code} SparkSession .builder .master("local[*]")

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Michael Armbrust
coalesce is nice because it does not shuffle, but the consequence of avoiding a shuffle is it will also reduce parallelism of the preceding computation. Have you tried using repartition instead? On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < andrii.bilets...@yahoo.com.invalid> wrote: > Hi

Re: Are there any Kafka forEachSink examples?

2017-05-23 Thread Michael Armbrust
There is an example in this post: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html On Tue, May 23, 2017 at 11:35 AM, kant kodali wrote: > Hi All, > > Are there any Kafka forEachSink examples

Re: 2.2. release date ?

2017-05-23 Thread Michael Armbrust
Mark is right. I will cut another RC as soon as the known issues are resolve. In the mean time it would be very helpful for people to test RC2 and report issues. On Tue, May 23, 2017 at 11:10 AM, Mark Hamstra wrote: > I heard that once we reach release candidates it's

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-22 Thread Michael Armbrust
t;>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are >>> essentially all for documentation. >>> >>> Joseph >>> >>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin <van...@cloudera.com> >>> wrote: >>> >

[jira] [Created] (SPARK-20844) Remove experimental from API and docs

2017-05-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20844: Summary: Remove experimental from API and docs Key: SPARK-20844 URL: https://issues.apache.org/jira/browse/SPARK-20844 Project: Spark Issue Type

[jira] [Updated] (SPARK-20599) ConsoleSink should work with write (batch)

2017-05-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20599: - Summary: ConsoleSink should work with write (batch) (was: KafkaSourceProvider should

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-22 Thread Michael Armbrust
There is an RC here. Please test! http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html On Fri, May 19, 2017 at 4:07 PM, kant kodali wrote: > Hi Patrick, > > I am using 2.1.1 and I tried the above code you sent and I get > >

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread Michael Armbrust
You can write it to the memory sink. df.writeStream.format("memory").queryName("myStream").start() spark.table("myStream").show() On Wed, May 17, 2017 at 7:55 PM, kant kodali wrote: > Hi All, > > How to see the full contents of dataset or dataframe is structured >

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Michael Armbrust
is in all spark machines under SPARK_HOME/jars. > > Still same error seems to persist. Is that the right jar or is there > anything else I need to add? > > Thanks! > > > > On Tue, May 16, 2017 at 1:40 PM, Michael Armbrust <mich...@databricks.com> > wrote: > &

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Michael Armbrust
Looks like you are missing the kafka dependency. On Tue, May 16, 2017 at 1:04 PM, kant kodali wrote: > Looks like I am getting the following runtime exception. I am using Spark > 2.1.0 and the following jars > > *spark-sql_2.11-2.1.0.jar* > >

Re: what is the difference between json format vs kafka format?

2017-05-15 Thread Michael Armbrust
For that simple count, you don't actually have to even parse the JSON data. You can just do a count. The following code assumes you are running Spark 2.2 .

Re: Spark SQL DataFrame to Kafka Topic

2017-05-15 Thread Michael Armbrust
The foreach sink from that blog post requires that you have a DataFrame with two columns in the form of a Tuple2, (String, String), where as your dataframe has only a single column `payload`. You could change the KafkaSink to extend ForeachWriter[KafkaMessage] and then it would work. I'd also

Re: Reading Avro messages from Kafka using Structured Streaming in Spark 2.1

2017-05-12 Thread Michael Armbrust
I believe that Avro/Kafka messages have a few bytes at the beginning of the message to denote which schema is being used. Have you tried using the KafkaAvroDecoder inside of the map instead? On Fri, May 12, 2017 at 9:26 AM, Revin Chalil wrote: > Just following up on this;

Re: Convert DStream into Streaming Dataframe

2017-05-12 Thread Michael Armbrust
Are there any particular things that the DataFrame or Dataset API are missing? On Fri, May 12, 2017 at 9:49 AM, Tejinder Aulakh wrote: > Hi, > > Is there any way to convert a DStream to a streaming dataframe? I want to > use Structured streaming in a new common module

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-11 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20666: - Target Version/s: 2.2.0 > Flaky test - SparkListenerBus randomly fail

[jira] [Commented] (SPARK-20376) Make StateStoreProvider plugable

2017-05-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003729#comment-16003729 ] Michael Armbrust commented on SPARK-20376: -- /cc [~tdas] > Make StateStoreProvider pluga

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
s. if that's clear, I could probably annotate my > bean class properly > > On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I think you are supposed to set BeanProperty on a var as they do here >> <https://github.com/apache/

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
eDataFrame( > SparkSession.scala:251) > at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:278) > ... 54 elided > > On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I think you are supposed to set Bea

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Michael Armbrust
I think you are supposed to set BeanProperty on a var as they do here . If you are using scala though I'd consider using the case

[VOTE] Apache Spark 2.2.0 (RC2)

2017-05-04 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.3.0 (was: 2.2.0) > Spark-SQL Nullability: Optimizations

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-03 Thread Michael Armbrust
h Sean. Spark only pulls in parquet-avro for tests. For >>>>>>> execution, it implements the record materialization APIs in Parquet to >>>>>>> go >>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 >&g

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
> > if I do dataset.select("nonExistentColumn") then the Analysis Error is > thrown at compile time right? > if you do df.as[MyClass].map(_.badFieldName) you will get a compile error. However, if df doesn't have the right columns for MyClass, that error will only be thrown at runtime (whether DF

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
An analysis exception occurs whenever the scala/java/python program is valid, but the dataframe operations being performed are not. For example, df.select("nonExistentColumn") would throw an analysis exception. On Wed, May 3, 2017 at 1:38 PM, kant kodali wrote: > Hi All, >

[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20569: - Affects Version/s: 2.2.0 > RuntimeReplaceable functions accept invalid third parame

[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995423#comment-15995423 ] Michael Armbrust commented on SPARK-20569: -- [~rxin] this does seem like a bug

[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20569: - Summary: RuntimeReplaceable functions accept invalid third parameter (was: In spark-sql

[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19104: - Affects Version/s: 2.2.0 Target Version/s: 2.2.0 > CompileException with

[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19104: - Description: The following code will run with Spark 2.0.2 but not with Spark 2.1.0

Re: [ANNOUNCE] Apache Spark 2.1.1

2017-05-03 Thread Michael Armbrust
fir.ma...@equalum.io > > On Wed, May 3, 2017 at 1:18 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> We are happy to announce the availability of Spark 2.1.1! >> >> Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 >> main

[jira] [Commented] (SPARK-20570) The main version number on docs/latest/index.html

2017-05-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995248#comment-15995248 ] Michael Armbrust commented on SPARK-20570: -- Hmmm, I did push them, and they show up on the [asf

[jira] [Created] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20567: Summary: Failure to bind when using explode and collect_set in streaming Key: SPARK-20567 URL: https://issues.apache.org/jira/browse/SPARK-20567 Project

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust
We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust
We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit

Re: Spark 2.2.0 or Spark 2.3.0?

2017-05-02 Thread Michael Armbrust
An RC for 2.2.0 was released last week. Please test. Note that update mode has been supported since 2.0. On Mon, May 1, 2017 at 10:43 PM, kant kodali wrote: > Hi All, > > If I understand the Spark

[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2017-05-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20547: - Affects Version/s: 2.2.0 Target Version/s: 2.2.0 > ExecutorClassLoader's findCl

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
Oh, and if you want a default other than null: import org.apache.spark.sql.functions._ df.withColumn("address", coalesce($"address", lit()) On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust <mich...@databricks.com> wrote: > The following should work

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
The following should work: val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema spark.read.schema(schema).parquet("data.parquet").as[Course] Note this will only work for nullable files (i.e. if you add a primitive like Int you need to make it an Option[Int]) On Sun, Apr 30, 2017

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Michael Armbrust
He's just suggesting that since the DataStreamWriter start() method can fill in an option named "path", we should make that a synonym for "topic". Then you could do something like. df.writeStream.format("kafka").start("topic") Seems reasonable if people don't think that is confusing. On Mon,

[jira] [Updated] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

2017-04-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20364: - Target Version/s: 2.2.0 Priority: Critical (was: Major) > Parquet predic

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
t we normally cut an RC after those things are ready? > > On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Michael Armbrust
I'll also +1 On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen <so...@cloudera.com> wrote: > +1 , same result as with the last RC. All checks out for me. > > On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust <mich...@databricks.com> > wrote: > >> Please vote on

[VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ...

[VOTE] Apache Spark 2.1.1 (RC4)

2017-04-26 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983820#comment-15983820 ] Michael Armbrust commented on SPARK-18057: -- I guess I'd like to understand more about what

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Armbrust
>>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will >>>>>> only scan all table files only once, and write back the inferred schema >>>>>> to >>>>>> metastore so that we don't need to do the

Re: Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Michael Armbrust
Foreach runs on the executors and so is not able to modify an array list that is only present on the driver. You should just call collectAsList on the DataFrame. On Mon, Apr 24, 2017 at 10:36 AM, Devender Yadav < devender.ya...@impetus.co.in> wrote: > Hi All, > > > I am using Spark 1.6.2 and

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979156#comment-15979156 ] Michael Armbrust edited comment on SPARK-18057 at 4/21/17 9:10 PM

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979382#comment-15979382 ] Michael Armbrust commented on SPARK-18057: -- Yes, 0.10.2.0 is the first release that promises

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979156#comment-15979156 ] Michael Armbrust commented on SPARK-18057: -- [~srowen], thanks for reporting, but based

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979126#comment-15979126 ] Michael Armbrust commented on SPARK-18057: -- If there are multiple reports of 0.10.2.0 being more

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Michael Armbrust
e actively investigating to find the > root cause of this problem, and specifically whether this is a problem in > the Spark codebase or not. I will report back when I have an answer to that > question. > > Michael > > > On Apr 18, 2017, at 11:59 AM, Michael Armbrust <mich...@databrick

[jira] [Reopened] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2017-04-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-16548: -- I'm not sure I agree. The default behavior for parsing corrupted JSON is to return

[jira] [Reopened] (SPARK-18891) Support for specific collection types

2017-04-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-18891: -- > Support for specific collection ty

[jira] [Resolved] (SPARK-18891) Support for specific collection types

2017-04-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18891. -- Resolution: Fixed Fix Version/s: 2.2.0 > Support for specific collection ty

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-18 Thread Michael Armbrust
In case it wasn't obvious by the appearance of RC3, this vote failed. On Thu, Mar 30, 2017 at 4:09 PM, Michael Armbrust <mich...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.1.0. The vote is open until Sun, April 2nd, 2018

[VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

branch-2.2 has been cut

2017-04-18 Thread Michael Armbrust
I just cut the release branch for Spark 2.2. If you are merging important bug fixes, please backport as appropriate. If you have doubts if something should be backported, please ping me. I'll follow with an RC later this week.

Re: 2.2 branch

2017-04-17 Thread Michael Armbrust
I'm going to cut branch-2.2 tomorrow morning. On Thu, Apr 13, 2017 at 11:02 AM, Michael Armbrust <mich...@databricks.com> wrote: > Yeah, I was delaying until 2.1.1 was out and some of the hive questions > were resolved. I'll make progress on that by the end of the week. Lets &

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971491#comment-15971491 ] Michael Armbrust commented on SPARK-20299: -- What input are you looking

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Michael Armbrust
the Jenkins cluster is a bit on the older side). >> >>>> >> >>>> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau <hol...@pigscanfly.ca> >> >>>> wrote: >> >>>>> >> >>>>> So the fix is installing pandoc on

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Michael Armbrust
> > 1) could we update documentation for Structured Streaming and describe > that checkpointing could be specified by > spark.sql.streaming.checkpointLocation > on SparkSession level and thus automatically checkpoint dirs will be > created per foreach query? > > Sure, please open a pull request.

[jira] [Resolved] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2017-04-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16899. -- Resolution: Not A Problem This has been fixed. I believe you are using an old version

[jira] [Updated] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2017-04-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16899: - Component/s: Structured Streaming > Structured Streaming Checkpointing Example inva

Re: 2.2 branch

2017-04-13 Thread Michael Armbrust
Yeah, I was delaying until 2.1.1 was out and some of the hive questions were resolved. I'll make progress on that by the end of the week. Lets aim for 2.2 branch cut next week. On Thu, Apr 13, 2017 at 8:56 AM, Koert Kuipers wrote: > i see there is no 2.2 branch yet for

<    1   2   3   4   5   6   7   8   9   10   >