[jira] [Created] (SPARK-19065) Bad error when using dropDuplicates in Streaming

2017-01-03 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19065: Summary: Bad error when using dropDuplicates in Streaming Key: SPARK-19065 URL: https://issues.apache.org/jira/browse/SPARK-19065 Project: Spark

[jira] [Created] (SPARK-19031) JDBC Streaming Source

2016-12-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19031: Summary: JDBC Streaming Source Key: SPARK-19031 URL: https://issues.apache.org/jira/browse/SPARK-19031 Project: Spark Issue Type: New Feature

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031 In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust
An encoder uses reflection to generate expressions that can extract data out of an object (by calling methods on the object) and encode its contents directly into the

[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Target Version/s: (was: 2.1.1) > Kafka 0.8 support for Structured Stream

[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762756#comment-15762756 ] Michael Armbrust commented on SPARK-17344: -- [KAFKA-4462] aims to give us backwards compatibility

[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Target Version/s: 2.1.1 > Kafka 0.8 support for Structured Stream

[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18908: - Target Version/s: 2.1.1 > It's hard for the user to see the failure if StreamExecut

[jira] [Created] (SPARK-18932) Partial aggregation for collect_set / collect_list

2016-12-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18932: Summary: Partial aggregation for collect_set / collect_list Key: SPARK-18932 URL: https://issues.apache.org/jira/browse/SPARK-18932 Project: Spark

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Oh and to get the null for missing years, you'd need to do an outer join with a table containing all of the years you are interested in. On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <mich...@databricks.com> wrote: > Are you looking for argmax? Here is an example > <https://

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Are you looking for argmax? Here is an example . On Wed, Dec 14, 2016 at 8:49 PM, Milin korath wrote: > Hi

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752929#comment-15752929 ] Michael Armbrust commented on SPARK-5632: - Hmm, I agree that error is confusing. It does work

[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-12-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18084: - Target Version/s: 2.2.0 > write.partitionBy() does not recognize nested colu

Re: Dataset encoders for further types?

2016-12-15 Thread Michael Armbrust
I would have sworn there was a ticket, but I can't find it. So here you go: https://issues.apache.org/jira/browse/SPARK-18891 A work around until that is fixed would be for you to manually specify the kryo encoder

[jira] [Created] (SPARK-18891) Support for specific collection types

2016-12-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18891: Summary: Support for specific collection types Key: SPARK-18891 URL: https://issues.apache.org/jira/browse/SPARK-18891 Project: Spark Issue Type

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752771#comment-15752771 ] Michael Armbrust commented on SPARK-5632: - If you expand the commit you'll see its included

[jira] [Resolved] (SPARK-12777) Dataset fields can't be Scala tuples

2016-12-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12777. -- Resolution: Fixed Fix Version/s: 2.1.0 This works in 2.1: https://databricks

Re: When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread Michael Armbrust
What is your use case? On Thu, Dec 15, 2016 at 10:43 AM, ljwagerfield wrote: > The current version of Spark (2.0.2) only supports one aggregation per > structured stream (and will throw an exception if multiple aggregations are > applied). > > Roughly when will

Re: [DataFrames] map function - 2.0

2016-12-15 Thread Michael Armbrust
Experimental in Spark really just means that we are not promising binary compatibly for those functions in the 2.x release line. For Datasets in particular, we want a few releases to make sure the APIs don't have any major gaps before removing the experimental tag. On Thu, Dec 15, 2016 at 1:17

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Michael Armbrust
Its hard to comment on performance without seeing query plans. I'd suggest posting the result of an explain. On Thu, Dec 15, 2016 at 2:14 PM, Warren Kim wrote: > Playing with TPC-H and comparing performance between cached (serialized > in-memory tables) and

[jira] [Commented] (SPARK-17890) scala.ScalaReflectionException

2016-12-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749860#comment-15749860 ] Michael Armbrust commented on SPARK-17890: -- If I had to guess, I would guess that [this line

Re: [Spark-SQL] collect_list() support for nested collection

2016-12-13 Thread Michael Armbrust
Yes https://databricks-prod-cloudfront.cloud.databricks.com/public/ 4027ec902e239c93eaaa8714f173bcfc/1023043053387187/4464261896877850/ 2840265927289860/latest.html On Tue, Dec 13, 2016 at 10:43 AM, Ninad Shringarpure wrote: > > Hi Team, > > Does Spark 2.0 support

[jira] [Updated] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2016-12-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17689: - Target Version/s: 2.2.0 Description: Steps to reproduce: 1) Start a streaming

[jira] [Updated] (SPARK-18272) Test topic addition for subscribePattern on Kafka DStream and Structured Stream

2016-12-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18272: - Issue Type: Test (was: Bug) > Test topic addition for subscribePattern on Kafka DStr

[jira] [Updated] (SPARK-18790) Keep a general offset history of stream batches

2016-12-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18790: - Target Version/s: 2.1.0 > Keep a general offset history of stream batc

[jira] [Updated] (SPARK-18796) StreamingQueryManager should not hold a lock when starting a query

2016-12-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18796: - Target Version/s: 2.1.0 > StreamingQueryManager should not hold a lock when start

Re: When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread Michael Armbrust
I would guess Spark 2.3, but maybe sooner maybe later depending on demand. I created https://issues.apache.org/jira/browse/SPARK-18791 so people can describe their requirements / stay informed. On Thu, Dec 8, 2016 at 11:16 AM, ljwagerfield wrote: > Hi there, > >

[jira] [Created] (SPARK-18791) Stream-Stream Joins

2016-12-08 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18791: Summary: Stream-Stream Joins Key: SPARK-18791 URL: https://issues.apache.org/jira/browse/SPARK-18791 Project: Spark Issue Type: New Feature

Re: few basic questions on structured streaming

2016-12-08 Thread Michael Armbrust
> > 1. what happens if an event arrives few days late? Looks like we have an > unbound table with sorted time intervals as keys but I assume spark doesn't > keep several days worth of data in memory but rather it would checkpoint > parts of the unbound table to a storage at a specified interval

[jira] [Commented] (SPARK-17890) scala.ScalaReflectionException

2016-12-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15729905#comment-15729905 ] Michael Armbrust commented on SPARK-17890: -- Can you reproduce this with 2.1? If so, I think we

[jira] [Resolved] (SPARK-16902) Custom ExpressionEncoder for primitive array is not effective

2016-12-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16902. -- Resolution: Not A Problem The encoder that is used is picked by scala's implicit

[jira] [Updated] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18754: - Target Version/s: 2.1.0 > Rename recentProgresses to recentProgr

[jira] [Created] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18754: Summary: Rename recentProgresses to recentProgress Key: SPARK-18754 URL: https://issues.apache.org/jira/browse/SPARK-18754 Project: Spark Issue Type

[jira] [Closed] (SPARK-18749) CLONE - checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-12-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-18749. Resolution: Invalid > CLONE - checkpointLocation being set in memory streams fail af

[jira] [Created] (SPARK-18749) CLONE - checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-12-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18749: Summary: CLONE - checkpointLocation being set in memory streams fail after restart. Should fail fast Key: SPARK-18749 URL: https://issues.apache.org/jira/browse/SPARK

[jira] [Closed] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-12-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-17921. Resolution: Won't Fix > checkpointLocation being set in memory streams fail after rest

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-06 Thread Michael Armbrust
.where("xxx IS NOT NULL") will give you the rows that couldn't be parsed. On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein < yeh...@veracity-group.com> wrote: > Hi all > > > > I’m trying to parse json using existing schema and got rows with NULL’s > > //get schema > > val df_schema =

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
> > 1. In my case, I'd need to first explode my data by ~12x to assign each > record to multiple 12-month rolling output windows. I'm not sure Spark SQL > would be able to optimize this away, combining it with the output writing > to do it incrementally. > You are right, but I wouldn't worry

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
If you repartition($"column") and then do .write.partitionBy("column") you should end up with a single file for each value of the partition column. On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson wrote: > Hi, > > I have a DataFrame of records with dates, and I'd like

Re: ability to provide custom serializers

2016-12-05 Thread Michael Armbrust
Lets start with a new ticket, link them and we can merge if the solution ends up working out for both cases. On Sun, Dec 4, 2016 at 5:39 PM, Erik LaBianca <erik.labia...@gmail.com> wrote: > Thanks Michael! > > On Dec 2, 2016, at 7:29 PM, Michael Armbrust <mich...@databricks.c

Re: ability to provide custom serializers

2016-12-02 Thread Michael Armbrust
I would love to see something like this. The closest related ticket is probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe there are enough people using UDTs in their current form that we should just make a new ticket) A few thoughts: - even if you can do implicit search, we

Re: Flink event session window in Spark

2016-12-02 Thread Michael Armbrust
Here is the JIRA for adding this feature: https://issues.apache.org/jira/browse/SPARK-10816 On Fri, Dec 2, 2016 at 11:20 AM, Fritz Budiyanto wrote: > Hi All, > > I need help on how to implement Flink event session window in Spark. Is > this possible? > > For instance, I

[jira] [Updated] (SPARK-18234) Update mode in structured streaming

2016-12-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18234: - Target Version/s: 2.2.0 > Update mode in structured stream

[jira] [Created] (SPARK-18682) Batch Source for Kafka

2016-12-01 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18682: Summary: Batch Source for Kafka Key: SPARK-18682 URL: https://issues.apache.org/jira/browse/SPARK-18682 Project: Spark Issue Type: New Feature

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Michael Armbrust
Yes ! On Thu, Dec 1, 2016 at 12:57 PM, ayan guha wrote: > Thanks TD. Will it be available in pyspark too? > On 1 Dec 2016 19:55, "Tathagata Das" wrote: > >> In

[jira] [Reopened] (SPARK-18122) Fallback to Kryo for unknown classes in ExpressionEncoder

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-18122: -- I'm going to reopen this. I think the benefits outweigh the compatibility concerns

[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.1.0 > Spark-SQL Nullability: Optimizations vs. Enforcem

[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.2.0 (was: 2.1.0) > Spark-SQL Nullability: Optimizations

[jira] [Created] (SPARK-18657) Persist UUID across query restart

2016-11-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18657: Summary: Persist UUID across query restart Key: SPARK-18657 URL: https://issues.apache.org/jira/browse/SPARK-18657 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18588: - Target Version/s: 2.1.0 > KafkaSourceStressForDontFailOnDataLossSuite is fl

[jira] [Resolved] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16545. -- Resolution: Later > Structured Streaming : foreachSink creates the Physical P

[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18655: - Fix Version/s: (was: 2.1.0) > Ignore Structured Streaming 2.0.2 logs in hist

[jira] [Updated] (SPARK-18655) Ignore Structured Streaming 2.0.2 logs in history server

2016-11-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18655: - Target Version/s: 2.1.0 > Ignore Structured Streaming 2.0.2 logs in history ser

[jira] [Resolved] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18516. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15954

[jira] [Reopened] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-18475: -- > Be able to provide higher parallelization for StructuredStreaming Kafka Sou

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706692#comment-15706692 ] Michael Armbrust commented on SPARK-18475: -- I think that this suggestion was closed prematurely

[jira] [Resolved] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18498. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15924

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-28 Thread Michael Armbrust
or us but the code below doesn't require me to pass > schema at all. > > import org.apache.spark.sql._ > val rdd = df2.rdd.map { case Row(j: String) => j } > spark.read.json(rdd).show() > > > On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <mich...@databricks

[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2016-11-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15703146#comment-15703146 ] Michael Armbrust commented on SPARK-18541: -- I don't think you can use {{as}} in python, as I

[jira] [Updated] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18498: - Target Version/s: 2.1.0 (was: 2.2.0) > Clean up HDFSMetadataLog API for better test

Re: Any equivalent method lateral and explore

2016-11-22 Thread Michael Armbrust
Both collect_list and explode are available in the function library . The following is an example of using it: df.select($"*", explode($"myArray") as 'arrayItem) On Tue, Nov 22, 2016 at 2:42 PM, Mahender Sarangam

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-22 Thread Michael Armbrust
ently. Any idea on > when 2.1 will be released? > > Thanks, > kant > > On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> In Spark 2.1 we've added a from_json >> <https://github.com/apache/spark/blob/master/sql/co

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
We are looking to add a native JDBC sink in Spark 2.2. Until then you can write your own connector using df.writeStream.foreach. On Tue, Nov 22, 2016 at 12:55 PM, shyla deshpande wrote: > Hi, > > Structured streaming works great with Kafka source but I need to persist

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
Forgot the link: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach On Tue, Nov 22, 2016 at 2:40 PM, Michael Armbrust <mich...@databricks.com> wrote: > We are looking to add a native JDBC sink in Spark 2.2. Until then you can > wr

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread Michael Armbrust
In Spark 2.1 we've added a from_json function that I think will do what you want. On Fri, Nov 18, 2016 at 2:29 AM, kant kodali wrote: > This seem to work > >

Re: Stateful aggregations with Structured Streaming

2016-11-21 Thread Michael Armbrust
We are planning on adding mapWithState or something similar in a future release. In the mean time, standard Dataframe aggregations should work (count, sum, etc). If you are looking to do something custom, I'd suggest looking at Aggregators

Re: Spark 2.0.2, Structured Streaming with kafka source... Unable to parse the value to Object..

2016-11-21 Thread Michael Armbrust
You could also do this with Datasets, which will probably be a little more efficient (since you are telling us you only care about one column) ds1.select($"value".as[Array[Byte]]).map(Student.parseFrom) On Thu, Nov 17, 2016 at 1:05 PM, shyla deshpande wrote: > Hello

Re: Create a Column expression from a String

2016-11-21 Thread Michael Armbrust
You are looking for org.apache.spark.sql.functions.expr() On Sat, Nov 19, 2016 at 6:12 PM, Stuart White wrote: > I'd like to allow for runtime-configured Column expressions in my > Spark SQL application. For example, if my application needs a 5-digit > zip code, but

[jira] [Created] (SPARK-18530) Kafka timestamp should be TimestampType

2016-11-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18530: Summary: Kafka timestamp should be TimestampType Key: SPARK-18530 URL: https://issues.apache.org/jira/browse/SPARK-18530 Project: Spark Issue Type

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Priority: Critical (was: Major) > Don't push down current_timestamp for filt

[jira] [Updated] (SPARK-18513) Record and recover watermark

2016-11-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18513: - Priority: Blocker (was: Major) > Record and recover waterm

[jira] [Updated] (SPARK-18513) Record and recover watermark

2016-11-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18513: - Target Version/s: 2.1.0 > Record and recover waterm

[jira] [Created] (SPARK-18529) Timeouts shouldn't be AssertionErrors

2016-11-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18529: Summary: Timeouts shouldn't be AssertionErrors Key: SPARK-18529 URL: https://issues.apache.org/jira/browse/SPARK-18529 Project: Spark Issue Type

[jira] [Updated] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18516: - Description: There are two types of information that you want to be able to extract from

[jira] [Updated] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18516: - Description: There are two types of information that you want to be able to extract from

[jira] [Created] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18516: Summary: Separate instantaneous state from progress performance statistics Key: SPARK-18516 URL: https://issues.apache.org/jira/browse/SPARK-18516 Project

Re: Analyzing and reusing cached Datasets

2016-11-19 Thread Michael Armbrust
You are hitting a weird optimization in withColumn. Specifically, to avoid building up huge trees with chained calls to this method, we collapse projections eagerly (instead of waiting for the optimizer). Typically we look for cached data in between analysis and optimization, so that

Re: Multiple streaming aggregations in structured streaming

2016-11-18 Thread Michael Armbrust
Doing this generally is pretty hard. We will likely support algebraic aggregate eventually, but this is not currently slotted for 2.2. Instead I think we will add something like mapWithState that lets users compute arbitrary stateful things. What is your use case? On Wed, Nov 16, 2016 at 6:58

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Assignee: Tyson Condie > Don't push down current_timestamp for filt

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Target Version/s: 2.1.0 (was: 2.2.0) > Don't push down current_timestamp for filt

[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"

2016-11-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18497: - Target Version/s: 2.1.0 > ForeachSink fails with "assertion failed:

[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"

2016-11-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18497: - Priority: Critical (was: Major) > ForeachSink fails with "assertion failed:

[jira] [Resolved] (SPARK-18461) Improve docs on StreamingQueryListener and StreamingQuery.status

2016-11-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18461. -- Resolution: Fixed Issue resolved by pull request 15897 [https://github.com/apache

[jira] [Commented] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-11-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671287#comment-15671287 ] Michael Armbrust commented on SPARK-17977: -- No, they were actually the same class for a while

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

[jira] [Resolved] (SPARK-18440) Fix FileStreamSink with aggregation + watermark + append mode

2016-11-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18440. -- Resolution: Fixed Issue resolved by pull request 15885 [https://github.com/apache

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Michael Armbrust
ray-json and spray-json-shapeless to implement Row marshalling for >> arbitrary case classes. It's checked and generated at compile time, >> supports arbitrary/nested case classes, and allows custom types. It is also >> entirely pluggable meaning you can bypass the default implementation

[jira] [Updated] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18407: - Description: [This assertion|https://github.com/apache/spark/blob

[jira] [Created] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-10 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18407: Summary: Inferred partition columns cause assertion error Key: SPARK-18407 URL: https://issues.apache.org/jira/browse/SPARK-18407 Project: Spark

Re: type-safe join in the new DataSet API?

2016-11-10 Thread Michael Armbrust
You can groupByKey and then cogroup. On Thu, Nov 10, 2016 at 10:44 AM, Yang wrote: > the new DataSet API is supposed to provide type safety and type checks at > compile time https://spark.apache.org/docs/latest/structured- >

[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652723#comment-15652723 ] Michael Armbrust commented on SPARK-17691: -- I think that should be able to use mutable buffers

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
wse/SPARK-18388 > > On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Which version of Spark? Does seem like a bug. >> >> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni < >> raviteja.lokin...@gmail.com> wrote: >

[jira] [Updated] (SPARK-18388) Running aggregation on many columns throws SOE

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18388: - Component/s: (was: Spark Core) SQL > Running aggregation on m

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
Which version of Spark? Does seem like a bug. On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni < raviteja.lokin...@gmail.com> wrote: > Does this stacktrace look like a bug guys? Definitely seems like one to me. > > Caused by: java.lang.StackOverflowError > at

[jira] [Assigned] (SPARK-18211) Spark SQL ignores split.size

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-18211: Assignee: Michael Armbrust > Spark SQL ignores split.s

[jira] [Resolved] (SPARK-18211) Spark SQL ignores split.size

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18211. -- Resolution: Not A Problem As of Spark 2.0 we do our own splitting/bin-packing of files

[jira] [Updated] (SPARK-10816) EventTime based sessionization

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10816: - Target Version/s: (was: 2.2.0) > EventTime based sessionizat

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Target Version/s: (was: 2.1.0) > Clarify Kafka offset semantics for Structu

[jira] [Resolved] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-17879. -- Resolution: Not A Problem > Don't compact metadata logs constantly into a sin

<    1   2   3   4   5   6   7   8   9   10   >