Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Michael Armbrust
+1 (binding) On Mon, Jun 8, 2020 at 1:22 PM DB Tsai wrote: > +1 (binding) > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Mon, Jun 8, 2020 at 1:03 PM Dongjoon Hyun > wrote: > > > > +1 >

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-04-02 Thread Michael Armbrust (Jira)
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073951#comment-17073951 ] Michael Armbrust commented on SPARK-29358: -- Sure, but it is very easy to make

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Michael Armbrust (Jira)
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071968#comment-17071968 ] Michael Armbrust commented on SPARK-29358: -- I think we should reconsider closing this as won't

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
> > What I'd oppose is to just ban char for the native data sources, and do > not have a plan to address this problem systematically. > +1 > Just forget about padding, like what Snowflake and MySQL have done. > Document that char(x) is just an alias for string. And then move on. Almost > no

[jira] [Commented] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-12 Thread Michael Armbrust (Jira)
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058287#comment-17058287 ] Michael Armbrust commented on SPARK-31136: -- How hard would it be to add support for "LOAD

[jira] [Commented] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-12 Thread Michael Armbrust (Jira)
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058125#comment-17058125 ] Michael Armbrust commented on SPARK-31136: -- What was the default before, hive sequence files

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-11 Thread Michael Armbrust
Thank you for the discussion everyone! This vote passes. I'll work to get this posed on the website. +1 Michael Armbrust Sean Owen Jules Damji 大啊 Ismaël Mejía Wenchen Fan Matei Zaharia Gengliang Wang Takeshi Yamamuro Denny Lee Xiao Li Xingbo Jiang Tkuya UESHIN Hichael Heuer John Zhuge Reynold Xin

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I'll start off the vote with a strong +1 (binding). On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust wrote: > I propose to add the following text to Spark's Semantic Versioning policy > <https://spark.apache.org/versioning-policy.html> and adopt it as the > rubric that shou

[VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0). I'll leave the vote open until Tuesday, March 10th at 2pm.

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Michael Armbrust
Thanks for the discussion! A few responses: The decision needs to happen at api/config change time, otherwise the > deprecated warning has no purpose if we are never going to remove them. > Even if we never remove an API, I think deprecation warnings (when done right) can still serve a purpose.

Re: Clarification on the commit protocol

2020-02-27 Thread Michael Armbrust
No, it is not. Although the commit protocol has mostly been superseded by Delta Lake , which is available as a separate open source project that works natively with Apache Spark. In contrast to the commit protocol, Delta can guarantee full ACID (rather than just partition level

[Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-24 Thread Michael Armbrust
Hello Everyone, As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-21 Thread Michael Armbrust
This plan for evolving the TRIM function to be more standards compliant sounds much better to me than the original change to just switch the order. It pushes users in the right direction and cleans up our tech debt without silently breaking existing workloads. It means that programs won't return

[jira] [Commented] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2019-07-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886479#comment-16886479 ] Michael Armbrust commented on SPARK-27911: -- You are right, there is nothing pyspark specific

Re: Announcing Delta Lake 0.2.0

2019-06-21 Thread Michael Armbrust
> > Thanks for confirmation. We are using the workaround to create a separate > Hive external table STORED AS PARQUET with the exact location of Delta > table. Our use case is batch-driven and we are running VACUUM with 0 > retention after every batch is completed. Do you see any potential problem

[jira] [Updated] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2019-05-31 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-27911: - Description: Today, users of pyspark (and Scala) need to manually specify the version

[jira] [Created] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2019-05-31 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-27911: Summary: PySpark Packages should automatically choose correct scala version Key: SPARK-27911 URL: https://issues.apache.org/jira/browse/SPARK-27911 Project

[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837552#comment-16837552 ] Michael Armbrust commented on SPARK-27676: -- I tend to agree that all cases where we chose

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread Michael Armbrust
+1 (binding), we've test this and it LGTM. On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.2. > > The vote is open until April 23 PST and passes if a majority +1 PMC votes > are cast, with > a minimum of 3 +1

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
and looks safe enough to me. I was just a > little surprised since I was expecting a correctness issue if this is > prompting a release. I'm definitely on the side of case-by-case judgments > on what to allow in patch releases and this looks fine. > > On Tue, Apr 16, 2019 at 4:27

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
by this behavior. Do you have a different proposal about how this should be handled? On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote: > Is this a bug fix? It looks like a new feature to me. > > On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust > wrote: > >> Hello All, >> >&

Spark 2.4.2

2019-04-16 Thread Michael Armbrust
Hello All, I know we just released Spark 2.4.1, but in light of fixing SPARK-27453 I was wondering if it might make sense to follow up quickly with 2.4.2. Without this fix its very hard to build a datasource that correctly handles partitioning

[jira] [Assigned] (SPARK-27453) DataFrameWriter.partitionBy is Silently Dropped by DSV1

2019-04-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-27453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-27453: Assignee: Liwen Sun > DataFrameWriter.partitionBy is Silently Dropped by D

[jira] [Created] (SPARK-27453) DataFrameWriter.partitionBy is Silently Dropped by DSV1

2019-04-12 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-27453: Summary: DataFrameWriter.partitionBy is Silently Dropped by DSV1 Key: SPARK-27453 URL: https://issues.apache.org/jira/browse/SPARK-27453 Project: Spark

[jira] [Commented] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader

2018-11-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685629#comment-16685629 ] Michael Armbrust commented on SPARK-23831: -- Why was it reverted? > Add org.apache.de

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
> > Agree. Just curious, could you explain what do you mean by "negation"? > Does it mean applying retraction on aggregated? > Yeah exactly. Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
Thanks for bringing up some possible future directions for streaming. Here are some thoughts: - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these

[jira] [Commented] (SPARK-6459) Warn when Column API is constructing trivially true equality

2018-07-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553330#comment-16553330 ] Michael Armbrust commented on SPARK-6459: - [~tenstriker] this will never happen from a SQL query

Re: Spark 2.3.0 Structured Streaming Kafka Timestamp

2018-05-11 Thread Michael Armbrust
Hmm yeah that does look wrong. Would be great if someone opened a PR to correct the docs :) On Thu, May 10, 2018 at 5:13 PM Yuta Morisawa wrote: > The problem is solved. > The actual schema of Kafka message is different from documentation. > > >

[jira] [Resolved] (SPARK-5517) Add input types for Java UDFs

2018-05-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5517. - Resolution: Unresolved > Add input types for Java U

[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466410#comment-16466410 ] Michael Armbrust commented on SPARK-18165: -- This is great!  I'm glad there are more connectors

[jira] [Updated] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18165: - Component/s: (was: DStreams) Structured Streaming > Kinesis supp

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust
t; would be efficient in terms of performance as compared to implementing this > functionality inside the applications. > > Hemant > > On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust <mich...@databricks.com > > wrote: > >> The basic tenet of structured streaming is that

Re: Sorting on a streaming dataframe

2018-04-26 Thread Michael Armbrust
The basic tenet of structured streaming is that a query should return the same answer in streaming or batch mode. We support sorting in complete mode because we have all the data and can sort it correctly and return the full answer. In update or append mode, sorting would only return a correct

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Michael Armbrust
You can calculate argmax using a struct. df.groupBy($"id").agg(max($"my_timestamp", struct($"*").as("data")).getField("data").select($"data.*") You could transcode this to SQL, it'll just be complicated nested queries. On Wed, Apr 18, 2018 at 3:40 PM, kant kodali wrote: >

[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-04-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433222#comment-16433222 ] Michael Armbrust commented on SPARK-23337: -- The checkpoint will only grow if you are doing

[jira] [Commented] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

2018-04-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423046#comment-16423046 ] Michael Armbrust commented on SPARK-23835: -- I believe the correct semantics are to throw

[jira] [Commented] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

2018-03-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420809#comment-16420809 ] Michael Armbrust commented on SPARK-23835: -- /cc [~cloud_fan] > When Dataset.as converts col

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread Michael Armbrust
Those options will not affect structured streaming. You are looking for .option("maxOffsetsPerTrigger", "1000") We are working on improving this by building a generic mechanism into the Streaming DataSource V2 so that the engine can do admission control on the amount of data returned in a

[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391861#comment-16391861 ] Michael Armbrust commented on SPARK-23325: -- It does seem like it would be that hard to stabilize

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Michael Armbrust
+1 all our pipelines have been running the RC for several days now. On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun wrote: > +1 (non-binding). > > Bests, > Dongjoon. > > > > On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue > wrote: > >> +1

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373611#comment-16373611 ] Michael Armbrust commented on SPARK-18057: -- My only concern is that it is stable and backwards

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372138#comment-16372138 ] Michael Armbrust commented on SPARK-18057: -- We generally tend towards "don't break t

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Michael Armbrust
I'm -1 on any changes that aren't fixing major regressions from 2.2 at this point. Also in any cases where its possible we should be flipping new features off if they are still regressing, rather than continuing to attempt to fix them. Since its experimental, I would support backporting the

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370780#comment-16370780 ] Michael Armbrust commented on SPARK-18057: -- +1 to upgrading and it would also be great to add

[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-02-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367933#comment-16367933 ] Michael Armbrust commented on SPARK-23337: -- This is essentially the same issue as SPARK-18084

[jira] [Updated] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-02-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-23173: - Labels: release-notes (was: ) > from_json can produce nulls for fields which are mar

Re: [Structured Streaming] Deserializing avro messages from kafka source using schema registry

2018-02-09 Thread Michael Armbrust
This isn't supported yet, but there is on going work at spark-avro to enable this use case. Stay tuned. On Fri, Feb 9, 2018 at 3:07 PM, Bram wrote: > Hi, > > I couldn't find any documentation about avro message

Re: [Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-02-09 Thread Michael Armbrust
We didn't go this way initially because it doesn't work on storage systems that have weaker guarantees than HDFS with respect to rename. That said, I'm happy to look at other options if we want to make this configurable. On Fri, Feb 9, 2018 at 2:53 PM, Dave Cameron

Re: DataSourceV2: support for named tables

2018-02-02 Thread Michael Armbrust
I am definitely in favor of first-class / consistent support for tables and data sources. One thing that is not clear to me from this proposal is exactly what the interfaces are between: - Spark - A (The?) metastore - A data source If we pass in the table identifier is the data source then

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-02 Thread Michael Armbrust
> > So here are my recommendations for moving forward, with DataSourceV2 as a > starting point: > >1. Use well-defined logical plan nodes for all high-level operations: >insert, create, CTAS, overwrite table, etc. >2. Use rules that match on these high-level plan nodes, so that it >

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-01-31 Thread Michael Armbrust
At this point I recommend that new applications are built using structured streaming. The engine was GA-ed as of Spark 2.2 and I know of several very large (trillions of records) production jobs that are running in Structured Streaming. All of our production pipelines at databricks are written

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331141#comment-16331141 ] Michael Armbrust commented on SPARK-20928: -- There is more work to do so I might leave

[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-01-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323883#comment-16323883 ] Michael Armbrust commented on SPARK-23050: -- [~zsxwing] is correct. While it is possible

Re: Dataset API inconsistencies

2018-01-10 Thread Michael Armbrust
I wrote Datasets, and I'll say I only use them when I really need to (i.e. when it would be very cumbersome to express what I am trying to do relationally). Dataset operations are almost always going to be slower than their DataFrame equivalents since they usually require materializing objects

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310423#comment-16310423 ] Michael Armbrust commented on SPARK-22947: -- +1 to [~rxin]'s question. This seems like it might

[jira] [Commented] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages

2017-12-31 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307144#comment-16307144 ] Michael Armbrust commented on SPARK-22929: -- Haha, thanks [~sowen], you are right. Kafka

[jira] [Created] (SPARK-22929) Short name for "kafka" doesn't work in pyspark with packages

2017-12-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-22929: Summary: Short name for "kafka" doesn't work in pyspark with packages Key: SPARK-22929 URL: https://issues.apache.org/jira/browse/SPARK-22929 Proj

[jira] [Created] (SPARK-22862) Docs on lazy elimination of columns missing from an encoder.

2017-12-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-22862: Summary: Docs on lazy elimination of columns missing from an encoder. Key: SPARK-22862 URL: https://issues.apache.org/jira/browse/SPARK-22862 Project: Spark

[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2017-12-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299232#comment-16299232 ] Michael Armbrust commented on SPARK-22739: -- Sounds good to me. I'm happy to provide pointers

[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2017-12-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299167#comment-16299167 ] Michael Armbrust commented on SPARK-22739: -- Any progress on this? Branch cut is January 1st

[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects

2017-12-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-22739: - Target Version/s: 2.3.0 > Additional Expression Support for Obje

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath. On Tue, Dec 19, 2017 at 5:42

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath. On Tue, Dec 19, 2017 at 5:42

Re: Timeline for Spark 2.3

2017-12-19 Thread Michael Armbrust
Do people really need to be around for the branch cut (modulo the person cutting the branch)? 1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as we enter the new year :) Michael On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau wrote: > Sounds

[jira] [Commented] (SPARK-22824) Spark Structured Streaming Source trait breaking change

2017-12-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295676#comment-16295676 ] Michael Armbrust commented on SPARK-22824: -- This is technically an internal API (as is all

[jira] [Assigned] (SPARK-22824) Spark Structured Streaming Source trait breaking change

2017-12-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-22824: Assignee: Jose Torres > Spark Structured Streaming Source trait breaking cha

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2017-12-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288391#comment-16288391 ] Michael Armbrust commented on SPARK-20928: -- An update on this. We've started to create subtasks

Re: queryable state & streaming

2017-12-08 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-16738 I don't believe anyone is working on it yet. I think the most useful thing is to start enumerating requirements and use cases and then we can talk about how to build it. On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <

Re: Kafka version support

2017-11-30 Thread Michael Armbrust
Oh good question. I was saying that the stock structured streaming connector should be able to talk to 0.11 or 1.0 brokers. On Thu, Nov 30, 2017 at 1:12 PM, Cody Koeninger wrote: > Are you talking about the broker version, or the kafka-clients artifact > version? > > On

Re: Kafka version support

2017-11-30 Thread Michael Armbrust
I would expect that to work. On Wed, Nov 29, 2017 at 10:17 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Just wondering if anyone has tried spark structured streaming kafka > connector (2.2) with Kafka 0.11 or Kafka 1.0 version > > Thanks > Raghav >

Re: Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread Michael Armbrust
Hmmm, we should allow that. current_timestamp() is acutally deterministic within any given batch. Could you open a JIRA ticket? On Fri, Nov 10, 2017 at 1:52 AM, wangsan wrote: > Hi all, > > How can I use current processing time to generate windows in streaming > processing? >

Timeline for Spark 2.3

2017-11-09 Thread Michael Armbrust
According to the timeline posted on the website, we are nearing branch cut for Spark 2.3. I'd like to propose pushing this out towards mid to late December for a couple of reasons and would like to hear what people think. 1. I've done release management during the Thanksgiving / Christmas time

Re: [Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Michael Armbrust
The relevant config is spark.sql.shuffle.partitions. Note that once you start a query, this number is fixed. The config will only affect queries starting from an empty checkpoint. On Wed, Nov 8, 2017 at 7:34 AM, Teemu Heikkilä wrote: > I have spark structured streaming job

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Michael Armbrust
+1 On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li wrote: > +1 > > 2017-11-04 11:00 GMT-07:00 Burak Yavuz : > >> +1 >> >> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan >> wrote: >> >>> +1 >>> >>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu

Re: Structured Stream equivalent of reduceByKey

2017-11-06 Thread Michael Armbrust
ate > store interactions. > > Also anyone aware of any design doc or some example about how we can add > new operation on dataSet and corresponding physical plan. > > > > On Thu, Oct 26, 2017 at 5:54 PM, Michael Armbrust <mich...@databricks.com> > wrote: >

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25,

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25,

Re: Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Michael Armbrust
spark-avro would be a good example to start with. On Sun, Oct 8, 2017 at 3:00 AM, Serega Sheypak wrote: > Hi, did anyone try to implement Spark SQL dataset reader from SEQ file > with protobuf inside to Dataset? > > Imagine I

Re: Chaining Spark Streaming Jobs

2017-09-18 Thread Michael Armbrust
uery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:222) > > While running on the EMR cluster all paths poin

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread Michael Armbrust
rt() > > query.awaitTermination() > > *and I use play json to parse input logs from kafka ,the parse function is > like* > > def parseFunction(str: String): (Long, String) = { > val json = Json.parse(str) > val timestamp = (json \ "time").get.toString(

Re: [SS]How to add a column with custom system time?

2017-09-14 Thread Michael Armbrust
.groupBy($"date") >>> .count() >>> .withColumn("window", window(current_timestamp(), "15 minutes")) >>> >>> /** >>> * output >>> */ >>> val query = results >>> .writeSt

Re: [Structured Streaming] Multiple sources best practice/recommendation

2017-09-14 Thread Michael Armbrust
I would probably suggest that you partition by format (though you can get the file name from the build in function input_file_name()). You can load multiple streams from different directories and union them together as long as the schema is the same after parsing. Otherwise you can just run

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-14 Thread Michael Armbrust
orage (in the same transaction with the > data) and initialize the custom sink with right batch id when application > re-starts. After this just ignore batch if current batchId <= > latestBatchId. > > Dmitry > > > 2017-09-13 22:12 GMT+03:00 Michael Armbrust <mich...@dat

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-13 Thread Michael Armbrust
ing with my own re-try logic (which is > basically, just ignore intermediate data, re-read from Kafka and re-try > processing and load)? > > Dmitry > > > 2017-09-12 22:43 GMT+03:00 Michael Armbrust <mich...@databricks.com>: > >> In the checkpoint directory t

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
ore detail, please? Is there some kind of > offset manager API that works as get-offset by batch id lookup table? > > Dmitry > > 2017-09-12 20:29 GMT+03:00 Michael Armbrust <mich...@databricks.com>: > >> I think that we are going to have to change the Sink API as par

Re: [SS]How to add a column with custom system time?

2017-09-12 Thread Michael Armbrust
Can you show all the code? This works for me. On Tue, Sep 12, 2017 at 12:05 AM, 张万新 <kevinzwx1...@gmail.com> wrote: > The spark version is 2.2.0 > > Michael Armbrust <mich...@databricks.com>于2017年9月12日周二 下午12:32写道: > >> Which version of spark? >> >

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-12 Thread Michael Armbrust
Can you show the full query you are running? On Tue, Sep 12, 2017 at 10:11 AM, 张万新 wrote: > Hi, > > I'm using structured streaming to count unique visits of our website. I > use spark on yarn mode with 4 executor instances and from 2 cores * 5g > memory to 4 cores * 10g

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-12 Thread Michael Armbrust
I think that we are going to have to change the Sink API as part of SPARK-20928 , which is why I linked these tickets together. I'm still targeting an initial version for Spark 2.3 which should happen sometime towards the end of the year.

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
inistic expressions are only allowed in > > Project, Filter, Aggregate or Window" > > Can you give more advice? > > Michael Armbrust <mich...@databricks.com>于2017年9月12日周二 上午4:48写道: > >> import org.apache.spark.sql.functions._ >> >> df.withColumn("w

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-11 Thread Michael Armbrust
The following will convert the whole row to JSON. import org.apache.spark.sql.functions.* df.select(to_json(struct(col("*" On Sat, Sep 9, 2017 at 6:27 PM, kant kodali wrote: > Thanks Ryan! In this case, I will have Dataset so is there a way to > convert Row to Json

Re: Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread Michael Armbrust
Checkpoints record what has been processed for a specific query, and as such only need to be defined when writing (which is how you "start" a query). You can use the DataFrame created with readStream to start multiple queries, so it wouldn't really make sense to have a single checkpoint there.

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.withColumn("window", window(current_timestamp(), "15 minutes")) On Mon, Sep 11, 2017 at 3:03 AM, 张万新 wrote: > Hi, > > In structured streaming how can I add a column to a dataset with current > system time aligned with 15

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Michael Armbrust
+1 On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue wrote: > +1 (non-binding) > > Thanks for making the updates reflected in the current PR. It would be > great to see the doc updated before it is finally published though. > > Right now it feels like this SPIP is focused

[jira] [Comment Edited] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-08-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148227#comment-16148227 ] Michael Armbrust edited comment on SPARK-20928 at 8/30/17 11:52 PM

[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-08-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148227#comment-16148227 ] Michael Armbrust commented on SPARK-20928: -- Hey everyone, thanks for your interest

Re: Increase Timeout or optimize Spark UT?

2017-08-23 Thread Michael Armbrust
I think we already set the number of partitions to 5 in tests ? On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz

Re: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-23 Thread Michael Armbrust
You can create a nested struct that contains multiple columns using struct(). Here's a pretty complete guide on working with nested data: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html On Wed, Aug 23, 2017 at 2:30 PM, JG Perrin

  1   2   3   4   5   6   7   8   9   10   >