[jira] [Comment Edited] (SPARK-18227) Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read from directory failed

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651778#comment-15651778 ] Michael Armbrust edited comment on SPARK-18227 at 11/9/16 7:12 PM

[jira] [Commented] (SPARK-18227) Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read from directory failed

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651778#comment-15651778 ] Michael Armbrust commented on SPARK-18227: -- The {{_spark_metadata}} directory holds

[jira] [Resolved] (SPARK-18227) Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read from directory failed

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18227. -- Resolution: Not A Problem > Parquet file stream sink create a hidden direct

[jira] [Resolved] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15406. -- Resolution: Done Fix Version/s: 2.1.0 > Structured streaming supp

[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18371: - Component/s: (was: Structured Streaming) DStreams > Spark Stream

[jira] [Updated] (SPARK-17815) Report committed offsets

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17815: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-15406) > Rep

[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-15406) > Ka

[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18057: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-15406) > Upd

[jira] [Updated] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs

2016-11-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18373: - Target Version/s: 2.1.0 > Make KafkaSource's failOnDataLoss=false work with Spark j

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Labels: (was: correctness) > Don't push down current_timestamp for filt

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Michael Armbrust
+1 On Tue, Nov 8, 2016 at 1:17 PM, Sean Owen wrote: > +1 binding > > (See comments on last vote; same results, except, the regression we > identified is fixed now.) > > > On Tue, Nov 8, 2016 at 6:10 AM Reynold Xin wrote: > >> Please vote on releasing

[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1564#comment-1564 ] Michael Armbrust commented on SPARK-17691: -- +1 > Add aggregate function to collect l

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Target Version/s: 2.2.0 > Don't push down current_timestamp for filt

[jira] [Resolved] (SPARK-18295) Match up to_json to from_json in null safety

2016-11-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18295. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15792

[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-11-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15044: - Target Version/s: 2.2.0 > spark-sql will throw "input path does not exist&qu

[jira] [Reopened] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-11-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-15044: -- I've actually heard this complaint from several different large hive users. Some times

Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

2016-11-07 Thread Michael Armbrust
If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA. On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin wrote: > I have a table with a few columns, some of which are arrays. Since > upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always

Re: NoSuchElementException

2016-11-07 Thread Michael Armbrust
What are you trying to do? It looks like you are mixing multiple SparkContexts together. On Fri, Nov 4, 2016 at 5:15 PM, Lev Tsentsiper wrote: > My code throws an exception when I am trying to create new DataSet from > within SteamWriter sink > > Simplified version

[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637622#comment-15637622 ] Michael Armbrust commented on SPARK-18277: -- We've been talking about better support for nested

[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637595#comment-15637595 ] Michael Armbrust commented on SPARK-18258: -- What sort of failures are you anticipating here

[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637491#comment-15637491 ] Michael Armbrust commented on SPARK-18258: -- I agree that we don't want to lock people in, which

[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18258: - Target Version/s: 2.2.0 > Sinks need access to offset representat

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Michael Armbrust
to some ordering is an >> operation that can be done efficiently in a single shuffle without first >> figuring out range boundaries. and it is needed for quite a few algos, >> including Window and lots of timeseries stuff. but it seems there is no way >> to express

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Michael Armbrust
+1 On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
s also a total sort under the hood, but its on >> (hashCode(key), secondarySortColumn) which is easier to distribute and >> therefore can be implemented more efficiently. >> >> >> >> >> >> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <mich...@databricks.

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
> > It is still unclear to me why we should remember all these tricks (or add > lots of extra little functions) when this elegantly can be expressed in a > reduce operation with a simple one line lamba function. > I think you can do that too. KeyValueGroupedDataset has a reduceGroups function.

[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18260: - Target Version/s: 2.1.0 Priority: Blocker (was: Major) > from_json

[jira] [Commented] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634766#comment-15634766 ] Michael Armbrust commented on SPARK-18260: -- We should return null if the input is null

[jira] [Updated] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18212: - Assignee: Cody Koeninger > Flaky test: org.apache.spark.sql.kafka

[jira] [Resolved] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18212. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15737

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18254: - Target Version/s: 2.1.0 > UDFs don't see aliased column na

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634122#comment-15634122 ] Michael Armbrust commented on SPARK-18254: -- Is this yet another bug caused by the generic

Re: incomplete aggregation in a GROUP BY

2016-11-03 Thread Michael Armbrust
Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted on), then please open a JIRA. On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews wrote: > While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a > HiveContext GROUP BY that no longer

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example . On Thu, Nov 3, 2016 at 4:53

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Priority: Critical (was: Major) > Clarify Kafka offset semantics for Structu

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-15406

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Target Version/s: 2.1.0 > Clarify Kafka offset semantics for Structured Stream

[jira] [Commented] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634061#comment-15634061 ] Michael Armbrust commented on SPARK-17937: -- I'm going to pull this out from the parent JIRA as I

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Michael Armbrust
+1 On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release

Re: Structured streaming aggregation - update mode

2016-11-02 Thread Michael Armbrust
Yeah, agreed. As mentioned here , its near the top of my list. I just opened SPARK-18234 to track. On Wed, Nov 2, 2016 at 3:24 PM, Cristian Opris wrote: > Hi, > > I've

[jira] [Created] (SPARK-18234) Update mode

2016-11-02 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18234: Summary: Update mode Key: SPARK-18234 URL: https://issues.apache.org/jira/browse/SPARK-18234 Project: Spark Issue Type: New Feature

Re: How to return a case class in map function?

2016-11-02 Thread Michael Armbrust
Thats a bug. Which version of Spark are you running? Have you tried 2.0.2? On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai) wrote: > Hi, all. > When I use a case class as return value in map function, spark always > raise a ClassCastException. > > I write an demo, like: > >

Re: error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Michael Armbrust
Spark doesn't know how to turn a Seq[Any] back into a row. You would need to create a case class or something where we can figure out the schema. What are you trying to do? If you don't care about specifics fields and you just want to serialize the type you can use kryo: implicit val anyEncoder

[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628170#comment-15628170 ] Michael Armbrust commented on SPARK-18212: -- +1 to upping the timeout. We run with {{.option

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
ael, > > > > Thanks for the reply. > > > > The following link says there is a open unresolved Jira for Structured > > streaming support for consuming from Kafka. > > > > https://issues.apache.org/jira/browse/SPARK-15406 > > > > Appreciate your

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Component/s: Structured Streaming > Clarify Kafka offset semantics for Structu

[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.1.0

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18057: - Component/s: Structured Streaming > Update structured streaming kafka from 10.

[jira] [Updated] (SPARK-17343) Prerequisites for Kafka 0.8 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17343: - Component/s: (was: DStreams) Structured Streaming > Prerequisi

[jira] [Updated] (SPARK-17837) Disaster recovery of offsets from WAL

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17837: - Component/s: Structured Streaming > Disaster recovery of offsets from

[jira] [Updated] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17834: - Component/s: (was: SQL) Structured Streaming > Fetch the earli

[jira] [Updated] (SPARK-17346) Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17346: - Component/s: (was: DStreams) Structured Streaming > Kafka 0

[jira] [Updated] (SPARK-17345) Prerequisites for Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17345: - Component/s: (was: DStreams) Structured Streaming > Prerequisi

[jira] [Updated] (SPARK-17815) Report committed offsets

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17815: - Component/s: (was: SQL) Structured Streaming > Report commit

[jira] [Updated] (SPARK-17812) More granular control of starting offsets (assign)

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17812: - Component/s: (was: SQL) Structured Streaming > More granu

[jira] [Updated] (SPARK-17813) Maximum data per trigger

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17813: - Component/s: (was: SQL) Structured Streaming > Maximum data

[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Component/s: (was: DStreams) Structured Streaming > Kafka

[jira] [Closed] (SPARK-17345) Prerequisites for Kafka 0.10 support in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-17345. Resolution: Fixed > Prerequisites for Kafka 0.10 support in Structured Stream

[jira] [Updated] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15406: - Component/s: Structured Streaming > Structured streaming support for consuming f

[jira] [Resolved] (SPARK-18025) Port streaming to use the commit protocol API

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18025. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15710

[jira] [Updated] (SPARK-18025) Port streaming to use the commit protocol API

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18025: - Component/s: Structured Streaming > Port streaming to use the commit protocol

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
I'm not aware of any open issues against the kafka source for structured streaming. On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande wrote: > I am building a data pipeline using Kafka, Spark streaming and Cassandra. > Wondering if the issues with Kafka source fixed in

[jira] [Resolved] (SPARK-16411) Add textFile API to structured streaming.

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16411. -- Resolution: Fixed Assignee: Prashant Sharma Fix Version/s: 2.1.0 >

[jira] [Commented] (SPARK-16738) Queryable state for Spark State Store

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627148#comment-15627148 ] Michael Armbrust commented on SPARK-16738: -- You can already query the state store today, if you

[jira] [Updated] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18187: - Priority: Critical (was: Major) > CompactibleFileStreamLog should not r

[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627134#comment-15627134 ] Michael Armbrust commented on SPARK-18187: -- I think the configuration should only be used when

[jira] [Commented] (SPARK-16454) Consider adding a per-batch transform for structured streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627125#comment-15627125 ] Michael Armbrust commented on SPARK-16454: -- What specifically is missing from the {{foreach

[jira] [Updated] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17879: - Target Version/s: 2.1.0 > Don't compact metadata logs constantly into a single compac

[jira] [Commented] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627116#comment-15627116 ] Michael Armbrust commented on SPARK-17879: -- What is the status here? From my perspective I

[jira] [Updated] (SPARK-17475) HDFSMetadataLog should not leak CRC files

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17475: - Affects Version/s: 2.0.1 Target Version/s: 2.1.0 > HDFSMetadataLog should not l

[jira] [Updated] (SPARK-17475) HDFSMetadataLog should not leak CRC files

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17475: - Assignee: Frederick Reiss > HDFSMetadataLog should not leak CRC fi

[jira] [Updated] (SPARK-18124) Observed delay based event time watermarks

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18124: - Target Version/s: 2.1.0 > Observed delay based event time waterma

[jira] [Updated] (SPARK-18124) Observed delay based event time watermarks

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18124: - Summary: Observed delay based event time watermarks (was: Implement watermarking

[jira] [Resolved] (SPARK-8360) Structured Streaming (aka Streaming DataFrames)

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8360. - Resolution: Implemented Assignee: Michael Armbrust Fix Version/s: 2.1.0

[jira] [Closed] (SPARK-10823) API design: external state management

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-10823. Resolution: Later We have a state store implemented today. I think we should revisit

[jira] [Updated] (SPARK-10815) Public API: Streaming Sources and Sinks

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10815: - Priority: Critical (was: Major) Summary: Public API: Streaming Sources and Sinks

[jira] [Updated] (SPARK-10815) API design: data sources and sinks

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10815: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-8360) >

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627072#comment-15627072 ] Michael Armbrust commented on SPARK-10816: -- Windows were added in [SPARK-14160], so I'm going

[jira] [Updated] (SPARK-10816) API design: window and session specification

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10816: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-8360) >

[jira] [Updated] (SPARK-10816) EventTime based sessionization

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10816: - Target Version/s: 2.2.0 > EventTime based sessionizat

[jira] [Updated] (SPARK-10816) EventTime based sessionization

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10816: - Summary: EventTime based sessionization (was: API design: window and session

[jira] [Updated] (SPARK-15472) Add support for writing partitioned `csv`, `json`, `text` formats in Structured Streaming

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15472: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-8360) >

[jira] [Updated] (SPARK-17829) Stable format for offset log

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17829: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-8360) > Sta

[jira] [Updated] (SPARK-18124) Implement watermarking for handling late data

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18124: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-8360

[jira] [Updated] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18187: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-8360

[jira] [Updated] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18144: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-8360

[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627050#comment-15627050 ] Michael Armbrust commented on SPARK-18212: -- /cc [~c...@koeninger.org] > Flaky t

Re: JIRA Components for Streaming

2016-11-01 Thread Michael Armbrust
I did this <https://issues.apache.org/jira/browse/SPARK/component/12331043/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-issues-panel>. Please help me correct any issues I may have missed. On Mon, Oct 31, 2016 at 11:37 AM, Michael Armbrust <mich...@databricks.com> w

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached (but calling register alone will not result in any materialization of results). In Spark 2.0 we renamed this function to

[jira] [Updated] (SPARK-17267) Long running structured streaming requirements

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17267: - Component/s: (was: DStreams) (was: SQL

[jira] [Updated] (SPARK-8360) Structured Streaming (aka Streaming DataFrames)

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8360: Component/s: (was: DStreams) (was: SQL) Structured

[jira] [Updated] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17604: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-17267

[jira] [Commented] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626803#comment-15626803 ] Michael Armbrust commented on SPARK-17604: -- I think it would be good for us to support data

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626800#comment-15626800 ] Michael Armbrust commented on SPARK-17935: -- You should be able to use casts (which I'd expect

[jira] [Resolved] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-11-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-17764. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15354

JIRA Components for Streaming

2016-10-31 Thread Michael Armbrust
I'm planning to do a little maintenance on JIRA to hopefully improve the visibility into the progress / gaps in Structured Streaming. In particular, while we share a lot of optimization / execution logic with SQL, the set of desired features and bugs is fairly different. Proposal: - Structured

[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data

2016-10-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-18124: Assignee: Michael Armbrust (was: Tathagata Das) > Implement watermark

[jira] [Updated] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18147: - Target Version/s: 2.1.0 Priority: Critical (was: Minor) > Broken Spark

[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613555#comment-15613555 ] Michael Armbrust commented on SPARK-18147: -- /cc [~cloud_fan] > Broken Spark SQL Code

Re: importing org.apache.spark.Logging class

2016-10-27 Thread Michael Armbrust
This was made internal to Spark. I'd suggest that you use slf4j directly. On Thu, Oct 27, 2016 at 2:42 PM, Reth RM wrote: > Updated spark to version 2.0.0 and have issue with importing > org.apache.spark.Logging > > Any suggested fix for this issue? >

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread Michael Armbrust
How big are your avro files? We collapse many small files into a single partition to eliminate scheduler overhead. If you need explicit parallelism you can also repartition. On Thu, Oct 27, 2016 at 5:19 AM, Prithish wrote: > I am trying to read a bunch of AVRO files from a

<    1   2   3   4   5   6   7   8   9   10   >