Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread Michael Armbrust
s > updates and produce results every second. I also need to reset the state > (the count) back to zero every 24 hours. > > > > > > > On Mon, Apr 10, 2017 at 11:49 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> Nope, structured streaming

[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-04-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963353#comment-15963353 ] Michael Armbrust commented on SPARK-19067: -- No, this will be available in Spark 2.2.0

Re: Cant convert Dataset to case class with Option fields

2017-04-10 Thread Michael Armbrust
Options should work. Can you give a full example that is freezing? Which version of Spark are you using? On Fri, Apr 7, 2017 at 6:59 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hi Devs, > I've some case classes here, and it's fields are all optional > case class

Re: Is the trigger interval the same as batch interval in structured streaming?

2017-04-10 Thread Michael Armbrust
Nope, structured streaming eliminates the limitation that micro-batching should affect the results of your streaming query. Trigger is just an indication of how often you want to produce results (and if you leave it blank we just run as quickly as possible). To control how tuples are grouped

[jira] [Commented] (SPARK-20216) Install pandoc on machine(s) used for packaging

2017-04-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956045#comment-15956045 ] Michael Armbrust commented on SPARK-20216: -- I think it all runs on https

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Michael Armbrust
l...@pigscanfly.ca> > *Sent:* Friday, March 31, 2017 6:25:20 PM > *To:* Xiao Li > *Cc:* Michael Armbrust; dev@spark.apache.org > *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2) > > -1 (non-binding) > > Python packaging doesn't seem to have quite worked out (looking > at

Re: map transform on array in spark sql

2017-04-04 Thread Michael Armbrust
If you can find the name of the struct field from the schema you can just do: df.select($"arrayField.a") Selecting a field from an array returns an array with that field selected from each element. On Mon, Apr 3, 2017 at 8:18 PM, Koert Kuipers wrote: > i have a DataFrame

Re: Convert Dataframe to Dataset in pyspark

2017-04-03 Thread Michael Armbrust
You don't need encoders in python since its all dynamically typed anyway. You can just do the following if you want the data as a string. sqlContext.read.text("/home/spark/1.6/lines").rdd.map(lambda row: row.value) 2017-04-01 5:36 GMT-07:00 Selvam Raman : > In Scala, > val ds

[VOTE] Apache Spark 2.1.1 (RC2)

2017-03-30 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

Re: Why VectorUDT private?

2017-03-30 Thread Michael Armbrust
I think really the right way to think about things that are marked private is, "this may disappear or change in a future minor release". If you are okay with that, working about the visibility restrictions is reasonable. On Thu, Mar 30, 2017 at 5:52 AM, Koert Kuipers wrote:

[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Fix Version/s: 2.2.0 > Spark structured steaming from kafka - last message proces

[jira] [Commented] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948035#comment-15948035 ] Michael Armbrust commented on SPARK-20103: -- It is fixed in 2.2 but by [SPARK-19876]. > Sp

[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Description: When the application starts after a failure or a graceful shutdown

[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Docs Text: (was: object StructuredStreaming { def main(args: Array[String]): Unit

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Michael Armbrust
Thanks, > Asher Krim > Senior Software Engineer > > On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> An update: I cut the tag for RC1 last night. Currently fighting with the >> release process. Will post RC1 once I ge

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
t; .select("message.*") // unnest the json > .as(Encoders.bean(Tweet.class)) > > Thanks > Kaniska > > - > > On Mon, Mar 27, 2017 at 1:25 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >>

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
ified in the email) ? > > col("value").cast("string") - throwing an error 'cannot find symbol > method col(java.lang.String)' > I tried $"value" which results into similar compilation error. > > Thanks > Kaniska > > > > On Mon, Mar 27, 2017 at

Re: unable to stream kafka messages

2017-03-27 Thread Michael Armbrust
, kaniska Mandal <kaniska.man...@gmail.com> wrote: > Hi Michael, > > Thanks much for the suggestion. > > I was wondering - whats the best way to deserialize the 'value' field > > > On Fri, Mar 24, 2017 at 11:47 AM, Michael Armbrust <mich...@databricks.com > >

Re: How to insert nano seconds in the TimestampType in Spark

2017-03-27 Thread Michael Armbrust
The timestamp type is only microsecond precision. You would need to store it on your own (as binary or limited range long or something) if you require nanosecond precision. On Mon, Mar 27, 2017 at 5:29 AM, Devender Yadav < devender.ya...@impetus.co.in> wrote: > Hi All, > > I am using spark

Re: unable to stream kafka messages

2017-03-24 Thread Michael Armbrust
Encoders can only map data into an object if those columns already exist. When we are reading from Kafka, we just get a binary blob and you'll need to help Spark parse that first. Assuming your data is stored in JSON it should be pretty straight forward. streams = spark .readStream()

Re: how to read object field within json file

2017-03-24 Thread Michael Armbrust
I'm not sure you can parse this as an Array, but you can hint to the parser that you would like to treat source as a map instead of as a struct. This is a good strategy when you have dynamic columns in your data. Here is an example of the schema you can use to parse this JSON and also how to use

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2017-03-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939156#comment-15939156 ] Michael Armbrust commented on SPARK-10816: -- Just a quick note for people interested

[jira] [Resolved] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18970. -- Resolution: Fixed Fix Version/s: 2.1.0 I'm going to close this, but please

[jira] [Closed] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-17344. Resolution: Won't Fix Unless someone really wants to work on this, i think the fact

[jira] [Updated] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19965: - Target Version/s: 2.2.0 > DataFrame batch reader may fail to infer partitions w

[jira] [Updated] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19767: - Component/s: (was: Structured Streaming) DStreams > API Doc pa

[jira] [Resolved] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-19013. -- Resolution: Later It seems like [HADOOP-13345] is the right solution here, but since

[jira] [Resolved] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-19788. -- Resolution: Won't Fix Thanks for the suggestion. However, as [~zsxwing] said

[jira] [Resolved] (SPARK-19932) Disallow a case that might cause OOM for steaming deduplication

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-19932. -- Resolution: Won't Fix Thanks for working on this. While I think it would be helpful

[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-19876: Assignee: Tyson Condie > Add OneTime trigger execu

[jira] [Updated] (SPARK-19876) Add OneTime trigger executor

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19876: - Target Version/s: 2.2.0 > Add OneTime trigger execu

[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19989: - Description: This test failed recently here: https://amplab.cs.berkeley.edu/jenkins

[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19989: - Target Version/s: 2.2.0 > Flaky Test: org.apache.spark.sql.kafka

Re: Outstanding Spark 2.1.1 issues

2017-03-22 Thread Michael Armbrust
seem like regression from 2.1 so we should be good to start the RC >> process. >> >> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >> Please speak up if I'm wrong, but none of these seem like critical >> regressi

[jira] [Created] (SPARK-20063) Trigger without delay when falling behind

2017-03-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20063: Summary: Trigger without delay when falling behind Key: SPARK-20063 URL: https://issues.apache.org/jira/browse/SPARK-20063 Project: Spark Issue

[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936876#comment-15936876 ] Michael Armbrust commented on SPARK-20009: -- Yeah, the DDL format is certainly a lot easier

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Michael Armbrust
Please speak up if I'm wrong, but none of these seem like critical regressions from 2.1. As such I'll start the RC process later today. On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau wrote: > I'm not super sure it should be a blocker for 2.1.1 -- is it a regression? >

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
ing to save to the cassandra DB and try to keep shuffle operations to > a strict minimum (at best none). As of now we are not entirely pleased with > our current performances, that's why I'm doing a kafka topic sharding POC > and getting the executor to handle the specificied partitio

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929275#comment-15929275 ] Michael Armbrust commented on SPARK-19982: -- I'm not sure if changing weak to strong references

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-16 Thread Michael Armbrust
Have you considered trying event time aggregation in structured streaming instead? On Thu, Mar 16, 2017 at 12:34 PM, Dominik Safaric wrote: > Hi all, > > As I’ve implemented a streaming application pulling data from Kafka every > 1 second (batch interval), I am

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Michael Armbrust
I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...")

Spark 2.2 Code-freeze - 3/20

2017-03-15 Thread Michael Armbrust
Hey Everyone, Just a quick announcement that I'm planning to cut the branch for Spark 2.2 this coming Monday (3/20). Please try and get things merged before then and also please begin retargeting of any issues that you don't think will make the release. Michael

Re: Should we consider a Spark 2.1.1 release?

2017-03-15 Thread Michael Armbrust
Hey Holden, Thanks for bringing this up! I think we usually cut patch releases when there are enough fixes to justify it. Sometimes just a few weeks after the release. I guess if we are at 3 months Spark 2.1.0 was a pretty good release :) That said, it is probably time. I was about to start

Re: Structured Streaming - Can I start using it?

2017-03-13 Thread Michael Armbrust
I think its very very unlikely that it will get withdrawn. The primary reason that the APIs are still marked experimental is that we like to have several releases before committing to interface stability (in particular the interfaces to write custom sources and sinks are likely to evolve). Also,

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15922715#comment-15922715 ] Michael Armbrust commented on SPARK-18057: -- So to summarize, it'll be unfortunate if Kafka

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905755#comment-15905755 ] Michael Armbrust commented on SPARK-18057: -- It seems like we can upgrade the existing Kafka10

[jira] [Updated] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19888: - Component/s: (was: Spark Core) DStreams > Seeing offs

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Michael Armbrust
'm experiencing a similar issue. Will this not be fixed in Spark > Streaming? > > Best, > Justin > > On Mar 10, 2017, at 8:34 AM, Michael Armbrust <mich...@databricks.com> > wrote: > > One option here would be to try Structured Streaming. We've added an > option

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Michael Armbrust
One option here would be to try Structured Streaming. We've added an option "failOnDataLoss" that will cause Spark to just skip a head when this exception is encountered (its off by default though so you don't silently miss data). On Fri, Mar 18, 2016 at 4:16 AM, Ramkumar Venkataraman <

[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18055: - Target Version/s: 2.2.0 > Dataset.flatMap can't work with types from customized

[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-18055: Assignee: Michael Armbrust > Dataset.flatMap can't work with types f

Re: How to unit test spark streaming?

2017-03-07 Thread Michael Armbrust
> > Basically you abstract your transformations to take in a dataframe and > return one, then you assert on the returned df > +1 to this suggestion. This is why we wanted streaming and batch dataframes to share the same API.

[jira] [Updated] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19813: - Target Version/s: 2.2.0 > maxFilesPerTrigger combo latestFirst may miss old fi

[jira] [Updated] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work

2017-03-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19690: - Priority: Critical (was: Major) > Join a streaming DataFrame with a batch DataFrame

[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2017-03-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18258: - Target Version/s: (was: 2.2.0) > Sinks need access to offset representat

Re: Why Spark cannot get the derived field of case class in Dataset?

2017-02-28 Thread Michael Armbrust
We only serialize things that are in the constructor. You would have access to it in the typed API (df.map(_.day)). I'd suggest making a factory method that fills these in and put them in the constructor if you need to get to it from other dataframe operations. On Tue, Feb 28, 2017 at 12:03 PM,

[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-24 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883642#comment-15883642 ] Michael Armbrust commented on SPARK-19715: -- This isn't a hypothetical. A user of structured

[jira] [Created] (SPARK-19721) Good error message for version mismatch in log files

2017-02-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19721: Summary: Good error message for version mismatch in log files Key: SPARK-19721 URL: https://issues.apache.org/jira/browse/SPARK-19721 Project: Spark

[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881731#comment-15881731 ] Michael Armbrust commented on SPARK-19715: -- [~lwlin] another file source features you might want

[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-02-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18057: - Summary: Update structured streaming kafka from 10.0.1 to 10.2.0 (was: Update

[jira] [Created] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19715: Summary: Option to Strip Paths in FileSource Key: SPARK-19715 URL: https://issues.apache.org/jira/browse/SPARK-19715 Project: Spark Issue Type: New

[jira] [Commented] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872470#comment-15872470 ] Michael Armbrust commented on SPARK-19637: -- >From JSON is harder because the second argum

Re: Pretty print a dataframe...

2017-02-16 Thread Michael Armbrust
The toString method of Dataset.queryExecution includes the various plans. I usually just log that directly. On Thu, Feb 16, 2017 at 8:26 AM, Muthu Jayakumar wrote: > Hello there, > > I am trying to write to log-line a dataframe/dataset queryExecution and/or > its logical

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Michael Armbrust
Thanks for your interest in Apache Spark Structured Streaming! There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the

[jira] [Created] (SPARK-19633) FileSource read from FileSink

2017-02-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19633: Summary: FileSource read from FileSink Key: SPARK-19633 URL: https://issues.apache.org/jira/browse/SPARK-19633 Project: Spark Issue Type: New

Re: Case class with POJO - encoder issues

2017-02-13 Thread Michael Armbrust
You are right, you need that PR. I pinged the author, but otherwise it would be great if someone could carry it over the finish line. On Sat, Feb 11, 2017 at 4:19 PM, Jason White wrote: > I'd like to create a Dataset using some classes from Geotools to do some >

[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()

2017-02-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864326#comment-15864326 ] Michael Armbrust commented on SPARK-19553: -- It seems like there are a couple of distinct feature

[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-10 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861918#comment-15861918 ] Michael Armbrust commented on SPARK-19477: -- If a lot of people are confused by this being lazy

Re: benefits of code gen

2017-02-10 Thread Michael Armbrust
Function1 is specialized, but nullSafeEval is Any => Any, so that's still going to box in the non-codegened execution path. On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers wrote: > based on that i take it that math functions would be primary beneficiaries > since they work on

Re: Un-exploding / denormalizing Spark SQL help

2017-02-07 Thread Michael Armbrust
I think the fastest way is likely to use a combination of conditionals (when / otherwise), first (ignoring nulls), while grouping by the id. This should get the answer with only a single shuffle. Here is an example

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
leReferenceSink",tableName2) > .option("checkpointLocation","checkpoint") > .start() > > > On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Read the JSON log of files that is in `/your/path/_spark_me

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
not the case then how would I go about ensuring no duplicates? > > > Thanks again for the awesome support! > > Regards > Sam > On Tue, 7 Feb 2017 at 18:05, Michael Armbrust <mich...@databricks.com> > wrote: > >> Sorry, I think I was a little unclear. There a

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
d not be any jobs left because I can see in the log > that its now polling for new changes, the latest offset is the right one > > After I kill it and relaunch it picks up that same file? > > > Sorry if I misunderstood you > > On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust <m

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks Micheal! >> >> >> >> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Here a JIRA: https://issues.apache.org/jira

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 We should add this soon. On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin wrote: > Hi All > > When trying to read a stream off S3 and I try and drop duplicates I get > the following error: > > Exception in thread

[jira] [Created] (SPARK-19497) dropDuplicates with watermark

2017-02-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19497: Summary: dropDuplicates with watermark Key: SPARK-19497 URL: https://issues.apache.org/jira/browse/SPARK-19497 Project: Spark Issue Type: New

[jira] [Updated] (SPARK-19478) JDBC Sink

2017-02-06 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19478: - Issue Type: New Feature (was: Bug) > JDBC Sink > - > >

[jira] [Created] (SPARK-19478) JDBC Sink

2017-02-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19478: Summary: JDBC Sink Key: SPARK-19478 URL: https://issues.apache.org/jira/browse/SPARK-19478 Project: Spark Issue Type: Bug Components

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
ns? > > Wouldnt it be simpler to just regex replace the numbers to remove the > quotes? > > > Regards > Sam > > On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Specifying the schema when parsing JSON wil

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
olving this? > > My current approach is to iterate over the JSON and identify which fields > are numbers and which arent then recreate the json > > But to be honest that doesnt seem like the cleanest approach, so happy for > advice on this > > Rega

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin

Re: frustration with field names in Dataset

2017-02-02 Thread Michael Armbrust
That might be reasonable. At least I can't think of any problems with doing that. On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers wrote: > since a dataset is a typed object you ideally don't have to think about > field names. > > however there are operations on Dataset that

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
wing code snippet so the import spark.implicits._ > would take effect: > > // ugly hack to get around Encoder can't be found compile time errors > > private object myImplicits extends SQLImplicits { > > protected override def _sqlContext: SQLContext = MySparkSingleton. > get

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
You need to enforce that an Encoder is available for the type A using a context bound . import org.apache.spark.sql.Encoder abstract class RawTable[A : Encoder](inDir: String) { ... } On Tue, Jan 31, 2017 at 8:12 PM, Don Drake

Re: using withWatermark on Dataset

2017-02-01 Thread Michael Armbrust
Can you give the full stack trace? Also which version of Spark are you running? On Wed, Feb 1, 2017 at 10:38 AM, Jerry Lam wrote: > Hi everyone, > > Anyone knows how to use withWatermark on Dataset? > > I have tried the following but hit this exception: > > dataset

[jira] (SPARK-16454) Consider adding a per-batch transform for structured streaming

2017-01-30 Thread Michael Armbrust (JIRA)
Title: Message Title Michael Armbrust commented on SPARK-16454

Re: kafka structured streaming source refuses to read

2017-01-30 Thread Michael Armbrust
Thanks for for following up! I've linked the relevant tickets to SPARK-18057 and I targeted it for Spark 2.2. On Sat, Jan 28, 2017 at 10:15 AM, Koert Kuipers wrote: > there was also already an existing spark ticket for

[jira] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2017-01-30 Thread Michael Armbrust (JIRA)
Title: Message Title Michael Armbrust updated an issue

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Michael Armbrust
Yeah, kafka server client compatibility can be pretty confusing and does not give good errors in the case of mismatches. This should be addressed in the next release of kafka (they are adding an API to query the servers capabilities). On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers

Re: printSchema showing incorrect datatype?

2017-01-25 Thread Michael Armbrust
Encoders are just an object based view on a Dataset. Until you actually materialize and object, they are not used and thus will not change the schema of the dataframe. On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers wrote: > scala> val x = Seq("a", "b").toDF("x") > x:

Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-23 Thread Michael Armbrust
+1 to Ryan's suggestion of setting maxOffsetsPerTrigger. This way you can at least see how quickly it is making progress towards catching up. On Sun, Jan 22, 2017 at 7:02 PM, Timothy Chan wrote: > I'm using version 2.02. > > The difference I see between using latest and

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days. On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz wrote: > Thanks for the response Burak, > > As any sane person I try to steer away from

[jira] [Updated] (SPARK-18682) Batch Source for Kafka

2017-01-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18682: - Assignee: Tyson Condie > Batch Source for Ka

[jira] [Closed] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2017-01-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-18475. Resolution: Won't Fix > Be able to provide higher parallelization for StructuredStream

[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2017-01-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18970: - Description: Spark streaming application uses S3 files as streaming sources. After

Re: Dataset Type safety

2017-01-10 Thread Michael Armbrust
> > As I've specified *.as[Person]* which does schema inferance then > *"option("inferSchema","true")" *is redundant and not needed! The resolution of fields is done by name, not by position for case classes. This is what allows us to support more complex things like JSON or nested structures.

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust
You might also be interested in this: https://issues.apache.org/jira/browse/SPARK-19031 On Tue, Jan 3, 2017 at 3:36 PM, Michael Armbrust <mich...@databricks.com> wrote: > I think we should add something similar to mapWithState in 2.2. It would > be great if you could add the descrip

[jira] [Created] (SPARK-19067) mapWithState Style API

2017-01-03 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19067: Summary: mapWithState Style API Key: SPARK-19067 URL: https://issues.apache.org/jira/browse/SPARK-19067 Project: Spark Issue Type: New Feature

<    1   2   3   4   5   6   7   8   9   10   >