[jira] [Updated] (SPARK-4366) Aggregation Improvement

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4366: Target Version/s: (was: 1.6.0) > Aggregation Improvem

[jira] [Updated] (SPARK-10129) math function: stddev_samp

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10129: - Target Version/s: (was: 1.6.0) > math function: stddev_s

[jira] [Updated] (SPARK-9218) Falls back to getAllPartitions when getPartitionsByFilter fails

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9218: Target Version/s: (was: 1.6.0) > Falls back to getAllPartitions w

[jira] [Updated] (SPARK-3864) Specialize join for tables with unique integer keys

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3864: Target Version/s: (was: 1.6.0) > Specialize join for tables with unique integer k

[jira] [Updated] (SPARK-3863) Cache broadcasted tables and reuse them across queries

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3863: Target Version/s: (was: 1.6.0) > Cache broadcasted tables and reuse them across quer

[jira] [Commented] (SPARK-6377) Set the number of shuffle partitions for Exchange operator automatically based on the size of input tables and the reduce-side operation.

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987104#comment-14987104 ] Michael Armbrust commented on SPARK-6377: - How does this relate to what we have done? Are we

[jira] [Updated] (SPARK-3860) Improve dimension joins

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3860: Target Version/s: (was: 1.6.0) > Improve dimension jo

[jira] [Updated] (SPARK-11328) Correctly propagate error message in the case of failures when writing parquet

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11328: - Target Version/s: (was: 1.6.0) > Correctly propagate error message in the c

Re: SparkSQL implicit conversion on insert

2015-11-03 Thread Michael Armbrust
Today you have to do an explicit conversion. I'd really like to open up a public UDT interface as part of Spark Datasets (SPARK-) that would allow you to register custom classes with conversions, but this won't happen till Spark 1.7 likely. On Mon, Nov 2, 2015 at 8:40 PM, Bryan Jeffrey

[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8513: Target Version/s: (was: 1.6.0) > _temporary may be left undeleted when a write

[jira] [Updated] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10345: - Target Version/s: (was: 1.6.0) > Flaky t

[jira] [Commented] (SPARK-11412) Support merge schema for ORC

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987127#comment-14987127 ] Michael Armbrust commented on SPARK-11412: -- This is only currently supported for parquet

[jira] [Commented] (SPARK-9557) Refactor ParquetFilterSuite and remove old ParquetFilters code

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987129#comment-14987129 ] Michael Armbrust commented on SPARK-9557: - [~lian cheng] ping. > Refactor ParquetFilterSu

[jira] [Updated] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9272: Target Version/s: (was: 1.6.0) > Persist information of individual partitions w

[jira] [Updated] (SPARK-7659) Sort by attributes that are not present in the SELECT clause when there is windowfunction analysis error

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7659: Target Version/s: (was: 1.6.0) > Sort by attributes that are not present in the SEL

[jira] [Updated] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11415: - Target Version/s: 1.6.0 > Catalyst DateType Shifts Input Data by Local Timez

[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987154#comment-14987154 ] Michael Armbrust commented on SPARK-6189: - We should support field names with periods now. You'll

[jira] [Resolved] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6189. - Resolution: Fixed > Pandas to DataFrame conversion should check field names for peri

[jira] [Resolved] (SPARK-11404) groupBy on column expressions

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11404. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9359

[jira] [Resolved] (SPARK-11393) CoGroupedIterator should respect the fact that GroupedIterator.hasNext is not idempotent

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11393. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9346

[jira] [Resolved] (SPARK-10727) Dataframe count is zero after 'except' operation

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10727. -- Resolution: Duplicate > Dataframe count is zero after 'except' operat

[jira] [Updated] (SPARK-9205) org.apache.spark.sql.hive.HiveSparkSubmitSuite failing for Scala 2.11

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9205: Target Version/s: (was: 1.6.0) > org.apache.spark.sql.hive.HiveSparkSubmitSuite fail

[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987069#comment-14987069 ] Michael Armbrust commented on SPARK-9241: - Thanks for working on this. It look like

[jira] [Updated] (SPARK-9241) Supporting multiple DISTINCT columns

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9241: Target Version/s: (was: 1.6.0) > Supporting multiple DISTINCT colu

[jira] [Resolved] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7160. - Resolution: Fixed Fix Version/s: 1.6.0 As I commented on the PR, I think the goals

[jira] [Commented] (SPARK-6819) Support nested types in SparkR DataFrame

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987099#comment-14987099 ] Michael Armbrust commented on SPARK-6819: - Can we resolve this? Looks like all the subtasks

[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987101#comment-14987101 ] Michael Armbrust commented on SPARK-6817: - Should we bump this now that we are past code freeze

[jira] [Updated] (SPARK-6380) Resolution of equi-join key in post-join projection

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6380: Target Version/s: (was: 1.6.0) > Resolution of equi-join key in post-join project

[jira] [Resolved] (SPARK-11477) support create Dataset from RDD

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11477. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9434

[jira] [Commented] (SPARK-11470) Figure out a good name for the public API

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987032#comment-14987032 ] Michael Armbrust commented on SPARK-11470: -- How about {{stateful}} or {{unsubstitutable

[jira] [Updated] (SPARK-9271) Concurrency bug triggered by partition predicate push-down

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9271: Target Version/s: (was: 1.6.0) > Concurrency bug triggered by partition predicate p

[jira] [Updated] (SPARK-7712) Window Function Improvements

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7712: Target Version/s: (was: 1.6.0) > Window Function Improveme

[jira] [Updated] (SPARK-7768) Make user-defined type (UDT) API public

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7768: Target Version/s: 2+ (was: 1.6.0) > Make user-defined type (UDT) API pub

[jira] [Updated] (SPARK-10448) Parquet schema merging should NOT merge UDT

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10448: - Target Version/s: (was: 1.6.0) > Parquet schema merging should NOT merge

[jira] [Updated] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-: Target Version/s: 2+ (was: 1.6.0) > Dataset API on top of Catalyst/DataFr

[jira] [Updated] (SPARK-9932) Data source API improvement (Spark 1.6)

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9932: Target Version/s: (was: 1.6.0) > Data source API improvement (Spark

[jira] [Updated] (SPARK-9983) Local physical operators for query execution

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9983: Target Version/s: (was: 1.6.0) > Local physical operators for query execut

[jira] [Commented] (SPARK-8829) Improve expression performance

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987115#comment-14987115 ] Michael Armbrust commented on SPARK-8829: - Is this ready to be closed? > Improve express

[jira] [Updated] (SPARK-9213) Improve regular expression performance (via joni)

2015-11-03 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9213: Target Version/s: (was: 1.6.0) > Improve regular expression performance (via j

Spark 1.6 Release Schedule

2015-10-31 Thread Michael Armbrust
Hey All, Just a friendly reminder that today (October 31st) is the scheduled code freeze for Spark 1.6. Since a lot of developers were busy with the Spark Summit last week I'm going to delay cutting the branch until Monday, November 2nd. After that point, we'll package a release for testing and

[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-31 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983945#comment-14983945 ] Michael Armbrust commented on SPARK-11431: -- Have you look at the explode that works on a column

Re: Pulling data from a secured SQL database

2015-10-31 Thread Michael Armbrust
I would try using the JDBC Data Source and save the data to parquet . You can then put that data on your Spark cluster (probably

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-31 Thread Michael Armbrust
This is a bug in DataFrame caching. You can avoid caching or turn off compression. It is fixed in Spark 1.5.1 On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > I don’t believe I have it on 1.5.1. Are you able to test the data locally > to confirm, or is

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Michael Armbrust
> > We have tried schema merging feature, but it's too slow, there're hundreds > of partitions. > Which version of Spark?

Re: SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-29 Thread Michael Armbrust
Its super cheap. Its just a hashtable stored on the driver. Yes you can have more than one name for the same DF. On Wed, Oct 28, 2015 at 6:17 PM, Anfernee Xu wrote: > Hi, > > I just want to understand the cost of DataFrame.registerTempTable(String), > is it just a

Re: Collect Column as Array in Grouped DataFrame

2015-10-29 Thread Michael Armbrust
You can use a Hive UDF. import org.apache.spark.sql.functions._ callUDF("collect_set", $"columnName") or just SELECT collect_set(columnName) FROM ... Note that in 1.5 I think this actually does not use tungsten. In 1.6 it should though. I'll add that the experimental Dataset API (preview in

Re: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-29 Thread Michael Armbrust
There were several bugs in Spark 1.5 and we strongly recommend you upgrade to 1.5.1. If the issue persists it would be helpful to see the result of calling explain. On Wed, Oct 28, 2015 at 4:46 PM, wrote: > Hi, just a couple cents. > > > > are your joining columns

[jira] [Created] (SPARK-11404) groupBy on column expressions

2015-10-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11404: Summary: groupBy on column expressions Key: SPARK-11404 URL: https://issues.apache.org/jira/browse/SPARK-11404 Project: Spark Issue Type: Sub-task

[jira] [Updated] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11188: - Assignee: Dilip Biswal (was: Michael Armbrust) > Elide stacktraces in bin/spark-

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Michael Armbrust
Yeah, this is unfortunate. It would be good to fix this, but its a non-trivial change. Tracked here if you'd like to vote on the issue: https://issues.apache.org/jira/browse/SPARK-4502 On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood wrote: > I noticed when querying struct

[jira] [Assigned] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-11188: Assignee: Michael Armbrust > Elide stacktraces in bin/spark-

[jira] [Resolved] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11188. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9194

[jira] [Reopened] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-11188: -- Reopening to track backporting. > Elide stacktraces in bin/spark-

[jira] [Resolved] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11370. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9330

[jira] [Resolved] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly

2015-10-29 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11379. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9337

[jira] [Resolved] (SPARK-11313) Implement cogroup

2015-10-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11313. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9324

[jira] [Updated] (SPARK-11313) Implement cogroup

2015-10-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11313: - Assignee: Wenchen Fan > Implement cogroup > - > >

Re: Hive Version

2015-10-28 Thread Michael Armbrust
Documented here: http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore In 1.4.1 we compile against 0.13.1 On Wed, Oct 28, 2015 at 2:26 PM, Bryan Jeffrey wrote: > All, > > I am using a HiveContext to create

[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978406#comment-14978406 ] Michael Armbrust commented on SPARK-11303: -- I picked it into branch-1.5, but I'm not sure

[jira] [Created] (SPARK-11377) withNewChildren should not convert StructType to Seq

2015-10-28 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11377: Summary: withNewChildren should not convert StructType to Seq Key: SPARK-11377 URL: https://issues.apache.org/jira/browse/SPARK-11377 Project: Spark

[jira] [Updated] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11303: - Description: When sampling and then filtering DataFrame from python, we get inconsistent

[jira] [Resolved] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11303. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9294

[jira] [Resolved] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-27 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11277. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9247

Re: Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Michael Armbrust
You'll need to convert it to a java.sql.Timestamp. On Tue, Oct 27, 2015 at 4:33 PM, Bryan Jeffrey wrote: > Hello. > > I am working to create a persistent table using SparkSQL HiveContext. I > have a basic Windows event case class: > > case class WindowsEvent( >

[jira] [Created] (SPARK-11347) Support for joining two datasets, returning a tuple of objects

2015-10-27 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11347: Summary: Support for joining two datasets, returning a tuple of objects Key: SPARK-11347 URL: https://issues.apache.org/jira/browse/SPARK-11347 Project

Re: How to implement zipWithIndex as a UDF?

2015-10-23 Thread Michael Armbrust
The user facing type mapping is documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang wrote: > If I have two columns > > StructType(Seq( > StructField("id", LongType), >

[jira] [Reopened] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-11229: -- > NPE in JoinedRow.isNullAt when spark.shuffle.memoryFractio

[jira] [Closed] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-11229. Resolution: Fixed Fix Version/s: 1.6.0 > NPE in JoinedRow.isNullAt w

[jira] [Resolved] (SPARK-11216) add encoder/decoder for external row

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11216. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9184

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
Unfortunately, the mechanisms that we use to differentiate columns automatically don't work particularly well in the presence of self joins. However, you can get it work if you use the $"column" syntax consistently: val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key",

[jira] [Updated] (SPARK-10743) keep the name of expression if possible when do cast

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10743: - Assignee: Wenchen Fan > keep the name of expression if possible when do c

[jira] [Resolved] (SPARK-10743) keep the name of expression if possible when do cast

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10743. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8859

[jira] [Resolved] (SPARK-11197) Run SQL query on files directly without create a table

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11197. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9173

[jira] [Resolved] (SPARK-9740) first/last aggregate NULL behavior

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9740. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8113 [https

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
--+ > | 10| > | 20| > +-+ > > > scala> j.select(largeValues("lv.value")).show > +-+ > |value| > +-+ > |1| > |5| > +-+ > > Or does this behavior have the same root cause as detailed in Michael's > email? > > > -I

[jira] [Resolved] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8654. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9036 [https

[jira] [Resolved] (SPARK-9210) checkValidAggregateExpression() throws exceptions with bad error messages

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9210. - Resolution: Fixed Assignee: Yin Huai Fix Version/s: 1.6.0

[jira] [Resolved] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11208. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9178

[jira] [Closed] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-11229. Resolution: Cannot Reproduce Closing, please reopen if you have addition info about

[jira] [Resolved] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11179. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9167

Re: hive thriftserver and fair scheduling

2015-10-20 Thread Michael Armbrust
Not the most obvious place in the docs... but this is probably helpful: https://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling You likely want to put each user in their own pool. On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood wrote: > Hi All, > > Does

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
For compatibility reasons, we always write data out as nullable in parquet. Given that that bit is only an optimization that we don't actually make much use of, I'm curious why you are worried that its changing to true? On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam wrote: >

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
; org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > > at scala.Option.foreach(Option.scala:236) > > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > > at > org

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
We support TRANSFORM. Are you having a problem using it? On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack wrote: > How to reuse hive custom transform scripts written in python or c++? > > These scripts process data from stdin and print to stdout in spark. > They use the

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Michael Armbrust
Thats not really intended to be a public API as there is some internal setup that needs to be done for Hive to work. Have you created a HiveContext in the same thread? Is there more to that stacktrace? On Tue, Oct 20, 2015 at 2:25 AM, Ayoub wrote: > Hello, > >

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
> > First, this is not documented in the official document. Maybe we should do > it? http://spark.apache.org/docs/latest/sql-programming-guide.html > Pull requests welcome. > Second, nullability is a significant concept in the database people. It is > part of schema. Extra codes are needed for

[jira] [Commented] (SPARK-11220) SQL data source gives confusing error message when file not found

2015-10-20 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965927#comment-14965927 ] Michael Armbrust commented on SPARK-11220: -- My hunch is this has something to do with file

Re: Concurrency/Multiple Users

2015-10-19 Thread Michael Armbrust
Unfortunately the implementation of SPARK-2087 didn't have enough tests and got broken in 1.4. In Spark 1.6 we will have a much more solid fix: https://github.com/apache/spark/commit/3390b400d04e40f767d8a51f1078fcccb4e64abd On Mon, Oct 19, 2015 at 2:13 PM, GooniesNeverSayDie

Re: flattening a JSON data structure

2015-10-19 Thread Michael Armbrust
Quickfix is probably to use Seq[Row] instead of Array (the types that are returned are documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) Really though you probably want to be using explode. Perhaps something like this would help? import

[jira] [Created] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11188: Summary: Elide stacktraces in bin/spark-sql for AnalysisExceptions Key: SPARK-11188 URL: https://issues.apache.org/jira/browse/SPARK-11188 Project: Spark

Re: Spark SQL: what does an exclamation mark mean in the plan?

2015-10-19 Thread Michael Armbrust
It means that there is an invalid attribute reference (i.e. a #n where the attribute is missing from the child operator). On Sun, Oct 18, 2015 at 11:38 PM, Xiao Li wrote: > Hi, all, > > After turning on the trace, I saw a strange exclamation mark in > the intermediate

[jira] [Updated] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11188: - Target Version/s: 1.4.2, 1.5.2, 1.6.0 (was: 1.6.0) > Elide stacktraces in bin/spark-

[jira] [Created] (SPARK-11196) Support for equality and pushdown of filters on some UDTs

2015-10-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11196: Summary: Support for equality and pushdown of filters on some UDTs Key: SPARK-11196 URL: https://issues.apache.org/jira/browse/SPARK-11196 Project: Spark

[jira] [Updated] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-10-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10978: - Target Version/s: 1.6.0 > Allow PrunedFilterScan to eliminate predicates from furt

[jira] [Resolved] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11088. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9104

Re: Dynamic partition pruning

2015-10-16 Thread Michael Armbrust
We don't support dynamic partition pruning yet. On Fri, Oct 16, 2015 at 10:20 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all > > > > I’m running sqls on spark 1.5.1 and using tables based on parquets. > > My tables are not pruned when joined on partition columns. > > Ex: >

[jira] [Commented] (SPARK-10165) Nested Hive UDF resolution fails in Analyzer

2015-10-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961159#comment-14961159 ] Michael Armbrust commented on SPARK-10165: -- That sounds like a different issue. Please open up

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961147#comment-14961147 ] Michael Armbrust commented on SPARK-11153: -- Its actually corrupted statistics in data

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961518#comment-14961518 ] Michael Armbrust commented on SPARK-: - Yeah, I think tuples are a pretty important use case

[jira] [Resolved] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11135. -- Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved

<    12   13   14   15   16   17   18   19   20   21   >