[jira] [Commented] (SPARK-12141) Use Jackson to serialize all events when writing event log

2016-05-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291971#comment-15291971 ] Michael Armbrust commented on SPARK-12141: -- My issue with the catch-all case that was added

Re: [Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Michael Armbrust
The state store for structured streaming is an internal concept, and isn't designed to be consumed by end users. I'm hoping to write some documentation on how to do aggregation, but support for reading from Kafka and other sources will likely come in Spark 2.1+ On Wed, May 18, 2016 at 3:50 AM,

[jira] [Updated] (SPARK-15384) Codegen CompileException "mapelements_isNull" is not an rvalue

2016-05-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15384: - Assignee: Wenchen Fan Target Version/s: 2.0.0 > Codegen CompileExcept

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Michael Armbrust
+1, excited for 2.0! On Wed, May 18, 2016 at 10:06 AM, Krishna Sankar wrote: > +1. Looks Good. > The mllib results are in line with 1.6.1. Deprecation messages. I will > convert to ml and test later in the day. > Also will try GraphX exercises for our Strata London Tutorial

Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Michael Armbrust
Yeah, can you open a JIRA with that reproduction please? You can ping me on it. On Tue, May 17, 2016 at 4:55 PM, Reynold Xin wrote: > It seems like the problem here is that we are not using unique names > for mapelements_isNull? > > > > On Tue, May 17, 2016 at 3:29 PM,

[jira] [Updated] (SPARK-15367) Add refreshTable back

2016-05-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15367: - Priority: Critical (was: Major) > Add refreshTable b

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust
In 2.0 you won't be able to do this. The long term vision would be to make this possible, but a window will be required (like the 24 hours you suggest). On Tue, May 17, 2016 at 1:36 AM, Todd wrote: > Hi, > We have a requirement to do count(distinct) in a processing batch

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Michael Armbrust
I don't think that you will be able to do that. ScalaReflection is based on the TypeTag of the object, and thus the schema of any particular object won't be available to it. Instead I think you want to use the register functions in UDFRegistration that take a schema. Does that make sense? On

[jira] [Resolved] (SPARK-10216) Avoid creating empty files during overwrite into Hive table with group by query

2016-05-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10216. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12855

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-13 Thread Michael Armbrust
+1 to the general structure of Reynold's proposal. I've found what we do currently a little confusing. In particular, it doesn't make much sense that @DeveloperApi things are always labeled as possibly changing. For example the Data Source API should arguably be one of the most stable

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Michael Armbrust
> > > logical plan after optimizer execution: > > Project [id#0L,id#1L] > !+- Filter (id#0L = cast(1 as bigint)) > ! +- Join Inner, Some((id#0L = id#1L)) > ! :- Subquery t > ! : +- Relation[id#0L] JSONRelation > ! +- Subquery u > ! +- Relation[id#1L] JSONRelation >

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Armbrust
That is a forward looking design doc and not all of it has been implemented yet. With Spark 2.0 the main sources and sinks will be file based, though we hope to quickly expand that now that a lot of infrastructure is in place. On Fri, May 6, 2016 at 2:11 PM, Ted Yu wrote:

[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273100#comment-15273100 ] Michael Armbrust commented on SPARK-15140: -- The 2.0 behavior seems correct. Ideally .toDS

[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14959: - Priority: Critical (was: Major) > ​Problem Reading partitioned ORC or Parquet fi

[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14959: - Target Version/s: 2.0.0 Component/s: (was: Input/Output

Re: Accessing JSON array in Spark SQL

2016-05-05 Thread Michael Armbrust
use df.selectExpr to evaluate complex expression (instead of just column names). On Thu, May 5, 2016 at 11:53 AM, Xinh Huynh wrote: > Hi, > > I am having trouble accessing an array element in JSON data with a > dataframe. Here is the schema: > > val json1 = """{"f1":"1",

[jira] [Resolved] (SPARK-15077) StreamExecution.awaitOffset may take too long because of thread starvation

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15077. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12852

[jira] [Updated] (SPARK-15062) Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or segfault

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15062: - Assignee: Bo Meng > Show on DataFrame causes OutOfMemoryEr

[jira] [Resolved] (SPARK-15062) Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or segfault

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15062. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12849

[jira] [Resolved] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14747. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12521

[jira] [Resolved] (SPARK-14830) Add RemoveRepetitionFromGroupExpressions optimizer

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14830. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12590

[jira] [Resolved] (SPARK-14579) Fix a race condition in StreamExecution.processAllAvailable

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14579. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12582

[jira] [Resolved] (SPARK-14637) object expressions cleanup

2016-05-02 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14637. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12399

[jira] [Resolved] (SPARK-14970) DataSource enumerates all files in FileCatalog to infer schema even if there is user specified schema

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14970. -- Resolution: Fixed Fix Version/s: 2.0.0 > DataSource enumerates all fi

[jira] [Updated] (SPARK-14997) Files in subdirectories are incorrectly considered in sqlContext.read.json()

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14997: - Labels: regresion (was: ) > Files in subdirectories are incorrectly conside

[jira] [Updated] (SPARK-15011) org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelations fails when hadoop 2.3 or hadoop 2.4 is used

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15011: - Priority: Critical (was: Major) > org.apache.spark.sql.hive.StatisticsSuite.anal

[jira] [Updated] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14993: - Priority: Critical (was: Major) > Inconsistent behavior of partitioning discov

[jira] [Resolved] (SPARK-14337) Push down casts beneath CaseWhen and If expressions

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14337. -- Resolution: Later Target Version/s: (was: 2.0.0) Closing as later since

[jira] [Resolved] (SPARK-14981) CatalogTable should contain sorting directions of sorting columns

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14981. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12759

[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264836#comment-15264836 ] Michael Armbrust commented on SPARK-6817: - [~shivaram] Sill trying to get any of this in Spark 2.0

[jira] [Updated] (SPARK-14997) Files in subdirectories are incorrectly considered in sqlContext.read.json()

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14997: - Priority: Critical (was: Major) > Files in subdirectories are incorrectly conside

[jira] [Updated] (SPARK-12854) Vectorize Parquet reader

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12854: - Assignee: Nong Li > Vectorize Parquet rea

[jira] [Resolved] (SPARK-12854) Vectorize Parquet reader

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12854. -- Resolution: Fixed Closing since all subtasks are done > Vectorize Parquet rea

[jira] [Updated] (SPARK-13421) Make output of a SparkPlan configurable

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13421: - Target Version/s: 2.1.0 (was: 2.0.0) > Make output of a SparkPlan configura

[jira] [Resolved] (SPARK-12852) Support create table DDL with bucketing

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12852. -- Resolution: Later > Support create table DDL with bucket

[jira] [Resolved] (SPARK-12849) Bucketing improvements follow-up

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12849. -- Resolution: Later Target Version/s: (was: 2.0.0) > Bucketing improveme

[jira] [Resolved] (SPARK-12851) Add the ability to understand tables bucketed by Hive

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12851. -- Resolution: Later > Add the ability to understand tables bucketed by H

[jira] [Resolved] (SPARK-13571) Track current database in SQL/HiveContext

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13571. -- Resolution: Fixed I think this was done as part of another PR. > Track curr

[jira] [Resolved] (SPARK-13424) Improve test coverage of EnsureRequirements

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13424. -- Resolution: Later Closing along with the PR, reopen when you have time. > Impr

[jira] [Updated] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14273: - Target Version/s: 2.1.0 (was: 2.0.0) > Add FileFormat.isSplittable to indicate whet

[jira] [Updated] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14237: - Target Version/s: 2.1.0 (was: 2.0.0) > De-duplicate partition value appending lo

[jira] [Updated] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14237: - Parent Issue: SPARK-13682 (was: SPARK-13664) > De-duplicate partition value append

[jira] [Updated] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13683: - Target Version/s: 2.1.0 (was: 2.0.0) > Finalize the public interface for OutputWri

[jira] [Updated] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14273: - Parent Issue: SPARK-13682 (was: SPARK-13664) > Add FileFormat.isSplittable to indic

[jira] [Updated] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13683: - Parent Issue: SPARK-13682 (was: SPARK-13664) > Finalize the public interf

[jira] [Resolved] (SPARK-15016) Simplify and Speedup HadoopFSRelation (follow-up)

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15016. -- Resolution: Duplicate > Simplify and Speedup HadoopFSRelation (follow

[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13682: - Target Version/s: 2.1.0 (was: 2.0.0) > Finalize the public API for FileFor

[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2016-04-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13682: - Issue Type: New Feature (was: Sub-task) Parent: (was: SPARK-13664

Re: How can I bucketize / group a DataFrame from parquet files?

2016-04-27 Thread Michael Armbrust
Unfortunately, I don't think there is an easy way to do this in 1.6. In Spark 2.0 we will make DataFrame = Dataset[Row], so this should work out of the box. On Mon, Apr 25, 2016 at 11:08 PM, Brandon White wrote: > I am creating a dataFrame from parquet files. The

[jira] [Resolved] (SPARK-14874) Remove the obsolete Batch representation

2016-04-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14874. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12638

Re: XML Data Source for Spark

2016-04-25 Thread Michael Armbrust
You are using a version of the library that was compiled for a different version of Scala than the version of Spark that you are using. Make sure that they match up. On Mon, Apr 25, 2016 at 5:19 PM, Mohamed ismail wrote: > here is an example with code. >

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Michael Armbrust
When you define a class inside of a method, it implicitly has a pointer to the outer scope of the method. Spark doesn't have access to this scope, so this makes it hard (impossible?) for us to construct new instances of that class. So, define your classes that you plan to use with Spark at the

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never needed to eagerly calculate the range boundaries (since Spark 1.0). On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao wrote: > Thanks Reynold for the reason as to why sortBykey invokes a Job > >

Re: Dataset aggregateByKey equivalent

2016-04-23 Thread Michael Armbrust
Have you looked at aggregators? https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html On Fri, Apr 22, 2016 at 6:45 PM, Lee Becker wrote: > Is there a way to do aggregateByKey on Datasets the way one can on an RDD? > > Consider the

[jira] [Resolved] (SPARK-14678) Add a file sink log to support versioning and compaction

2016-04-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14678. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12435

[jira] [Updated] (SPARK-14767) Codegen "no constructor found" errors with Maps inside case classes in Datasets

2016-04-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14767: - Priority: Critical (was: Major) > Codegen "no constructor found" err

[jira] [Resolved] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14741. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12517

[jira] [Resolved] (SPARK-14555) Python API for methods introduced for Structured Streaming

2016-04-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14555. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12320

[jira] [Resolved] (SPARK-13929) Use Scala reflection for UDFs

2016-04-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13929. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12149

Re: prefix column Spark

2016-04-19 Thread Michael Armbrust
A few comments: - Each withColumnRename is adding a new level to the logical plan. We have optimized this significantly in newer versions of Spark, but it is still not free. - Transforming to an RDD is going to do fairly expensive conversion back and forth between the internal binary format. -

Re: Will nested field performance improve?

2016-04-15 Thread Michael Armbrust
> > If we expect fields nested in structs to always be much slower than flat > fields, then I would be keen to address that in our ETL pipeline with a > flattening step. If it's a known issue that we expect will be fixed in > upcoming releases, I'll hold off. > The difference might be even larger

Re: Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Michael Armbrust
This would also probably improve performance: https://github.com/apache/spark/pull/9565 On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari wrote: > Hi all, > > So we have these UDFs which take <1ms to operate and we're seeing pretty > poor performance around them in

[jira] [Updated] (SPARK-14648) Spark EC2 script creates cluster but spark is not installed properly.

2016-04-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14648: - Assignee: Josh Rosen > Spark EC2 script creates cluster but spark is not instal

Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust
You need to use `backticks` to reference columns that have non-standard characters. On Wed, Apr 13, 2016 at 6:56 AM, wrote: > Hi, > > I am debugging a program, and for some reason, a line calling the > following is failing: > > df.filter("sum(OpenAccounts) >

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
; How would you connect to Hive for some data and then reach out to lets say > Oracle or DB2 for some other data that you may want but isn’t available on > your cluster? > > > On Apr 12, 2016, at 10:52 AM, Michael Armbrust <mich...@databricks.com> > wrote: > > You can, bu

[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13753: - Description: There is a problem in spark sql to derive nullable column and used

[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13753: - Description: There is a problem in spark sql to derive nullable column and used

[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13753: - Target Version/s: 2.0.0 Priority: Critical (was: Major) > Column nulla

Re: Aggregator support in DataFrame

2016-04-12 Thread Michael Armbrust
). > > On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> saw that, dont think it solves it. i basically want to add some children >> to the expression i guess, to indicate what i am operating on? not sure if >> even makes sense >>

Re: ordering over structs

2016-04-12 Thread Michael Armbrust
gt; .groupBy("customer_id")\ > .agg(min("vs").alias("final"))\ > .select("customer_id", "final.dt", "final.product") > df.head() > > My log from the non-cached run: > http://pastebin.com/F88sSv1B > > Log fro

[jira] [Resolved] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-12 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14474. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12247

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You can, but I'm not sure why you would want to. If you want to isolate different users just use hiveContext.newSession(). On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande wrote: > Hi, > > Is it possible to have both a sqlContext and a hiveContext in the same > application

Re: Aggregator support in DataFrame

2016-04-11 Thread Michael Armbrust
I'll note this interface has changed recently: https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d I'm not sure that solves your problem though... On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers wrote: > i like the Aggregator a lot

[jira] [Resolved] (SPARK-14494) Fix the race conditions in MemoryStream and MemorySink

2016-04-11 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14494. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12261

Re: ordering over structs

2016-04-08 Thread Michael Armbrust
or saying: > > AttributeError: 'StructType' object has no attribute 'alias' > > Can I do this without aliasing the struct? Or am I doing something > incorrectly? > > > regards, > > imran > > On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust

[jira] [Created] (SPARK-14463) read.text broken for partitioned tables

2016-04-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14463: Summary: read.text broken for partitioned tables Key: SPARK-14463 URL: https://issues.apache.org/jira/browse/SPARK-14463 Project: Spark Issue Type

[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables

2016-04-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230946#comment-15230946 ] Michael Armbrust commented on SPARK-14463: -- [~rxin] > read.text broken for partitioned tab

[jira] [Resolved] (SPARK-14456) Remove unused variables and logics in DataSource

2016-04-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14456. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12237

[jira] [Created] (SPARK-14449) SparkContext should use SparkListenerInterface

2016-04-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14449: Summary: SparkContext should use SparkListenerInterface Key: SPARK-14449 URL: https://issues.apache.org/jira/browse/SPARK-14449 Project: Spark Issue

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > Ordering for a struct goes in order of the fields. So the max struct is > the one with the highest TotalValue (and then the highest category > if there are multiple entries with the same hour and total value). > > Is this due to "InterpretedOrdering" in StructType? > That is one

Re: ordering over structs

2016-04-06 Thread Michael Armbrust
> > 1) Is a struct in Spark like a struct in C++? > Kinda. Its an ordered collection of data with known names/types. > 2) What is an alias in this context? > it is assigning a name to the column. similar to doing AS in sql. > 3) How does this code even work? > Ordering for a struct

Re: Using an Option[some primitive type] in Spark Dataset API

2016-04-06 Thread Michael Armbrust
> We only define implicits for a subset of the types we support in > SQLImplicits > . > We should probably consider adding Option[T] for common T as the internal > infrastructure

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-04-06 Thread Michael Armbrust
> > Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] Use toDS() and you can skip the .as[Text] > Sure! It works with map, but not with select. Wonder if it's by design > or...will soon be fixed? Thanks again for your help. This is by design. select is relational and works with column

[jira] [Resolved] (SPARK-14411) Add a note to warn that onQueryProgress is asynchronous

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14411. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12180

[jira] [Resolved] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14402. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12175

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
act scenario when it will prune partitions. I am bit > confused now. Isnt there a way to see the exact partition pruning? > > Thanks > > On Tue, Apr 5, 2016 at 8:59 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> For the in-memory cache, we still launch tasks,

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
9.2 KB (memory) / 1 42.0 B / 1 > 5 32 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 2 > ms 5 ms 0 ms 0 ms 0.0 B 60.3 KB (memory) / 1 42.0 B / 1 > 6 33 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 3 > ms 4 ms 0 ms 0 ms 0.0 B 70.3 KB (memor

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Michael Armbrust
You should generally think of a DataFrame as unordered, unless you are explicitly asking for an order. One way to order and assign an index is with window functions . On Tue, Apr 5, 2016 at 4:17 AM, Angel

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
Can you show your full code. How are you partitioning the data? How are you reading it? What is the resulting query plan (run explain() or EXPLAIN). On Tue, Apr 5, 2016 at 10:02 AM, dsing001 wrote: > HI, > > I am using 1.5.2. I have a dataframe which is partitioned

[jira] [Resolved] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14257. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12049

[jira] [Updated] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14402: - Target Version/s: 2.0.0 > initcap UDF doesn't match Hive/Oracle behavior in lowercas

[jira] [Updated] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14402: - Labels: releasenotes (was: ) > initcap UDF doesn't match Hive/Oracle behav

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226779#comment-15226779 ] Michael Armbrust commented on SPARK-14389: -- Are you changing the broadcast threshold? >

[jira] [Resolved] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14345. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12131

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226744#comment-15226744 ] Michael Armbrust commented on SPARK-14389: -- This is a little surprising since the larger

[jira] [Updated] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14389: - Target Version/s: 2.0.0 > OOM during BroadcastNestedLoopJ

[jira] [Resolved] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-04-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14287. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12080

Re: [SQL] Dataset.map gives error: missing parameter type for expanded function?

2016-04-04 Thread Michael Armbrust
It is called groupByKey now. Similar to joinWith, the schema produced by relational joins and aggregations is different than what you would expect when working with objects. So, when combining DataFrame+Dataset we renamed these functions to make this distinction clearer. On Sun, Apr 3, 2016 at

[jira] [Resolved] (SPARK-14176) Add processing time trigger

2016-04-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14176. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11976

<    4   5   6   7   8   9   10   11   12   13   >