Re: [Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Michael Armbrust
What error are you getting? Here is an example . External types are documented here:

[jira] [Resolved] (SPARK-14255) Streaming Aggregation

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14255. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12048

Re: Support for time column type?

2016-04-01 Thread Michael Armbrust
There is also CalendarIntervalType. Is that what you are looking for? On Fri, Apr 1, 2016 at 1:11 PM, Philip Weaver wrote: > Hi, I don't see any mention of a time type in the documentation (there is > DateType and TimestampType, but not TimeType), and have been unable

Re: What influences the space complexity of Spark operations?

2016-04-01 Thread Michael Armbrust
Blocking operators like Sort, Join or Aggregate will put all of the data for a whole partition into a hash table or array. However, if you are running Spark 1.5+ we should be spilling to disk. In Spark 1.6 if you are seeing OOMs for SQL operations you should report it as a bug. On Thu, Mar 31,

[jira] [Updated] (SPARK-14160) Windowing for structured streaming

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14160: - Assignee: Burak Yavuz > Windowing for structured stream

[jira] [Resolved] (SPARK-14160) Windowing for structured streaming

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14160. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12008

[jira] [Resolved] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14070. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11891

[jira] [Resolved] (SPARK-14191) Fix Expand operator constraints

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14191. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11995

[jira] [Resolved] (SPARK-13995) Extract correct IsNotNull constraints for Expression

2016-04-01 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13995. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11809

[jira] [Created] (SPARK-14288) Memory Sink

2016-03-31 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14288: Summary: Memory Sink Key: SPARK-14288 URL: https://issues.apache.org/jira/browse/SPARK-14288 Project: Spark Issue Type: Sub-task

Re: pyspark read json file with high dimensional sparse data

2016-03-30 Thread Michael Armbrust
You can force the data to be loaded as a sparse map assuming the key/value types are consistent. Here is an example . On Wed, Mar 30,

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Michael Armbrust
Some answers and more questions inline - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects > as parameters. I cannot take in a case class object in place of the > corresponding Row object, even if the schema matches because the Row object > will always be passed in at

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Michael Armbrust
+1 to Matei's reasoning. On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia wrote: > I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the > entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's > the default version we built with in

[jira] [Resolved] (SPARK-14268) rename toRowExpressions and fromRowExpression to serializer and deserializer in ExpressionEncoder

2016-03-30 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14268. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12058

[jira] [Created] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14255: Summary: Streaming Aggregation Key: SPARK-14255 URL: https://issues.apache.org/jira/browse/SPARK-14255 Project: Spark Issue Type: Sub-task

[jira] [Updated] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-28 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13531: - Priority: Major (was: Minor) > Some DataFrame joins stopped work

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
cs/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") > > If so, it returns the same error: > > java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths:? > hdfs://user/hdfs/analytics/app1/PAGEVIEW > hdfs://user/hdf

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Michael Armbrust
I would not recommend using the direct output committer with HDFS. Its intended only as an optimization for S3. On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar wrote: > Hi, > > We are doing the following to save a dataframe in parquet (using > DirectParquetOutputCommitter) as

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

[jira] [Resolved] (SPARK-12443) encoderFor should support Decimal

2016-03-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12443. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10399

[jira] [Updated] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-25 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14048: - Target Version/s: 2.0.0 > Aggregation operations on structs fail when the structs h

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Michael Armbrust
On Thu, Mar 24, 2016 at 4:54 PM, Mark Hamstra wrote: > It's a pain in the ass. Especially if some of your transitive > dependencies never upgraded from 2.10 to 2.11. > Yeah, I'm going to have to agree here. It is not as bad as it was in the 2.9 days, but its still

Re: Column explode a map

2016-03-24 Thread Michael Armbrust
If you know the map keys ahead of time then you can just extract them directly. Here are a few examples . On Thu, Mar 24, 2016 at 12:01

Re: calling individual columns from spark temporary table

2016-03-24 Thread Michael Armbrust
).map(x => > (x.getString(0),x.getString(1).) > > Can you give an example of column expression please > like > > df.filter(col("paid") > "").col("firstcolumn").getString ? > > > > > On Thursday, 24 March 2016, 0:45, Micha

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-24 Thread Michael Armbrust
>> >> $ wget >>> >> >> >>> >> >> >>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz >>> >> >> --2016-03-18 07:55:30-- >>> >> >> >>> >> >> >&g

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
s there anyway one can keep the csv column names using databricks when > mapping > > val r = df.filter(col("paid") > "").map(x => > (x.getString(0),x.getString(1).) > > can I call example x.getString(0).as.(firstcolumn) in above when mapping > if poss

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust
You probably need to use `backticks` to escape `_1` since I don't think that its a valid SQL identifier. On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar wrote: > Gurus, > > If I register a temporary table as below > > r.toDF > res58: org.apache.spark.sql.DataFrame =

[jira] [Resolved] (SPARK-14078) Simple FileSink for Parquet

2016-03-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14078. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11897

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14070: - Target Version/s: 2.0.0 > Use ORC data source for SQL queries on ORC tab

[jira] [Created] (SPARK-14078) Simple FileSink for Parquet

2016-03-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14078: Summary: Simple FileSink for Parquet Key: SPARK-14078 URL: https://issues.apache.org/jira/browse/SPARK-14078 Project: Spark Issue Type: Sub-task

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14070: - Assignee: Tejas Patil > Use ORC data source for SQL queries on ORC tab

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14070: - Shepherd: Michael Armbrust > Use ORC data source for SQL queries on ORC tab

[jira] [Resolved] (SPARK-13985) WAL for determistic batches with IDs

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13985. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11804

[jira] [Updated] (SPARK-14029) Improve BooleanSimplification optimization by implementing `Not` canonicalization

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14029: - Assignee: Dongjoon Hyun > Improve BooleanSimplification optimization by implement

[jira] [Resolved] (SPARK-14029) Improve BooleanSimplification optimization by implementing `Not` canonicalization

2016-03-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-14029. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11851

[jira] [Resolved] (SPARK-13883) buildReader implementation for parquet

2016-03-21 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13883. -- Resolution: Fixed Issue resolved by pull request 11709 [https://github.com/apache

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust
> > But when tired using Spark streamng I could not find a way to store the > data with the avro schema information. The closest that I got was to create > a Dataframe using the json RDDs and store them as parquet. Here the parquet > files had a spark specific schema in their footer. > Does this

Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust
It's helpful if you can include the output of EXPLAIN EXTENDED or df.explain(true) whenever asking about query performance. On Mon, Mar 21, 2016 at 6:27 AM, gtinside wrote: > Hi , > > I am trying to execute a simple query with join on 3 tables. When I look at > the execution

Re: Subquery performance

2016-03-20 Thread Michael Armbrust
t? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* March-17-16 8:59 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: Subquery performance > > > > Try running EXPLAIN on both version of the query.

Re: Subquery performance

2016-03-19 Thread Michael Armbrust
Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

[jira] [Created] (SPARK-13985) WAL for determistic batches with IDs

2016-03-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13985: Summary: WAL for determistic batches with IDs Key: SPARK-13985 URL: https://issues.apache.org/jira/browse/SPARK-13985 Project: Spark Issue Type: Sub

[jira] [Resolved] (SPARK-13427) Support USING clause in JOIN

2016-03-19 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13427. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11297

[jira] [Updated] (SPARK-13945) Enable native view flag by default

2016-03-18 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13945: - Target Version/s: 2.0.0 > Enable native view flag by defa

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Michael Armbrust
Patrick reuploaded the artifacts, so it should be fixed now. On Mar 16, 2016 5:48 PM, "Nicholas Chammas" wrote: > Looks like the other packages may also be corrupt. I’m getting the same > error for the Spark 1.6.1 / Hadoop 2.4 package. > > >

[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM

2016-03-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195967#comment-15195967 ] Michael Armbrust commented on SPARK-12546: -- There is no partitioning in that example so

[jira] [Resolved] (SPARK-13876) Strategy for planning scans of files

2016-03-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13876. -- Resolution: Fixed Fix Version/s: 2.0.0 Resolved by https://github.com/apache

[jira] [Reopened] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-13664: -- > Simplify and Speedup HadoopFSRelat

Re: question about catalyst and TreeNode

2016-03-15 Thread Michael Armbrust
Trees are immutable, and TreeNode takes care of copying unchanged parts of the tree when you are doing transformations. As a result, even if you do construct a DAG with the Dataset API, the first transformation will turn it back into a tree. The only exception to this rule is when we share the

[jira] [Created] (SPARK-13883) buildReader implementation for parquet

2016-03-14 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13883: Summary: buildReader implementation for parquet Key: SPARK-13883 URL: https://issues.apache.org/jira/browse/SPARK-13883 Project: Spark Issue Type

[jira] [Resolved] (SPARK-13791) Add MetadataLog and HDFSMetadataLog

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13791. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11625

[jira] [Updated] (SPARK-10380) Confusing examples in pyspark SQL docs

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10380: - Assignee: Reynold Xin > Confusing examples in pyspark SQL d

[jira] [Resolved] (SPARK-10380) Confusing examples in pyspark SQL docs

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10380. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11698

[jira] [Resolved] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13664. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11646

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194436#comment-15194436 ] Michael Armbrust commented on SPARK-13118: -- Its likely that we have fixed this with other

[jira] [Updated] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13531: - Target Version/s: 2.0.0 > Some DataFrame joins stopped work

[jira] [Created] (SPARK-13876) Strategy for planning scans of files

2016-03-14 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13876: Summary: Strategy for planning scans of files Key: SPARK-13876 URL: https://issues.apache.org/jira/browse/SPARK-13876 Project: Spark Issue Type: Sub

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
On Mon, Mar 14, 2016 at 1:30 PM, Prabhu Joseph wrote: > > Thanks for the recommendation. But can you share what are the > improvements made above Spark-1.2.1 and how which specifically handle the > issue that is observed here. > Memory used for query execution is

[jira] [Updated] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13658: - Assignee: Liang-Chi Hsieh > BooleanSimplification rule is slow with large bool

[jira] [Resolved] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13658. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11647

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust
+1 to upgrading Spark. 1.2.1 has non of the memory management improvements that were added in 1.4-1.6. On Mon, Mar 14, 2016 at 2:03 AM, Prabhu Joseph wrote: > The issue is the query hits OOM on a Stage when reading Shuffle Output > from previous stage.How come

[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-13855: Assignee: Michael Armbrust > Spark 1.6.1 artifacts not found in S3 buc

Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust
> > Each json file is of a single object and has the potential to have > variance in the schema. > How much variance are we talking? JSON->Parquet is going to do well with 100s of different columns, but at 10,000s many things will probably start breaking.

Re: Can someone fix this download URL?

2016-03-14 Thread Michael Armbrust
Yeah, sorry. I'll make sure this gets fixed. On Mon, Mar 14, 2016 at 12:48 AM, Sean Owen wrote: > Yeah I can't seem to download any of the artifacts via the direct download > / cloudfront URL. The Apache mirrors are fine, so use those for the moment. > @marmbrus were you

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust
Or look at explode on DataFrame On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov wrote: > Hi, > > I have a problem that requires me to go through the rows in a DataFrame > (or possibly through rows in a JSON file) and conditionally add rows > depending on a value in one of

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust
df.select("event").toJSON On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius wrote: > Hmm. I think my problem is a little more complex. I'm using > https://github.com/databricks/spark-redshift and when I read from JSON > file I got this schema. > > root > > |-- app: string

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once,

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
This vote passes with nine +1s (five binding) and one binding +0! Thanks to everyone who tested/voted. I'll start work on publishing the release today. +1: Mark Hamstra* Moshe Eshel Egor Pahomov Reynold Xin* Yin Huai* Andrew Or* Burak Yavuz Kousuke Saruta Michael Armbrust* 0: Sean Owen* -1

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust
>> ^[[31m at >>>>>>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m >>>>>>> ^[[31m at >>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m >>>>>&g

[jira] [Resolved] (SPARK-13781) Use ExpressionSets in ConstraintPropagationSuite

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13781. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11611

[jira] [Updated] (SPARK-13527) Prune Filters based on Constraints

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13527: - Assignee: Xiao Li > Prune Filters based on Constrai

[jira] [Resolved] (SPARK-13527) Prune Filters based on Constraints

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13527. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11406

[jira] [Resolved] (SPARK-13728) Fix ORC PPD

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13728. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11593

[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187626#comment-15187626 ] Michael Armbrust commented on SPARK-13393: -- No user is going to write {{df("a&qu

[jira] [Resolved] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13763. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11599

[jira] [Updated] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13754: - Assignee: Hossein Falaki > Keep old data source name for backwards compatibil

[jira] [Resolved] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13754. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11589

[jira] [Resolved] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13750. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11590

[jira] [Created] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13750: Summary: Fix sizeInBytes for HadoopFSRelation Key: SPARK-13750 URL: https://issues.apache.org/jira/browse/SPARK-13750 Project: Spark Issue Type: Sub

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the

[jira] [Updated] (SPARK-13728) Fix ORC PPD

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13728: - Assignee: Hyukjin Kwon > Fix ORC PPD > --- > > Key:

[jira] [Commented] (SPARK-13728) Fix ORC PPD

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185463#comment-15185463 ] Michael Armbrust commented on SPARK-13728: -- That sounds like a good lead to follow! > Fix

[jira] [Commented] (SPARK-13665) Initial separation of concerns in HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185460#comment-15185460 ] Michael Armbrust commented on SPARK-13665: -- I think what everyone is going to want to see

[jira] [Updated] (SPARK-13665) Initial separation of concerns in HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13665: - Summary: Initial separation of concerns in HadoopFSRelation (was: Initial separation

[jira] [Created] (SPARK-13738) Clean up ResolveDataSource

2016-03-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13738: Summary: Clean up ResolveDataSource Key: SPARK-13738 URL: https://issues.apache.org/jira/browse/SPARK-13738 Project: Spark Issue Type: Sub-task

[jira] [Resolved] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13648. -- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved

[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13648: - Fix Version/s: (was: 1.6.1) 1.6.2

[jira] [Updated] (SPARK-13722) No Push Down for Non-deterministic Predicates through Generate

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13722: - Assignee: Xiao Li > No Push Down for Non-deterministic Predicates through Gener

[jira] [Resolved] (SPARK-13722) No Push Down for Non-deterministic Predicates through Generate

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13722. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11562

[jira] [Updated] (SPARK-13730) Nulls in dataframes getting converted to 0 with spark 2.0 SNAPSHOT

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13730: - Target Version/s: 2.0.0 > Nulls in dataframes getting converted to 0 with spark

Re: Nulls getting converted to 0 with spark 2.0 SNAPSHOT

2016-03-07 Thread Michael Armbrust
That looks like a bug to me. Open a JIRA? On Mon, Mar 7, 2016 at 11:30 AM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > Just wanted to confirm that this is the expected behaviour. > > Basically I'm putting nulls into a non-nullable LongType column and doing > a transformation

[jira] [Created] (SPARK-13729) Reimplement the planning tests on SimpleTextRelation

2016-03-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13729: Summary: Reimplement the planning tests on SimpleTextRelation Key: SPARK-13729 URL: https://issues.apache.org/jira/browse/SPARK-13729 Project: Spark

[jira] [Created] (SPARK-13728) Fix ORC PPD

2016-03-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13728: Summary: Fix ORC PPD Key: SPARK-13728 URL: https://issues.apache.org/jira/browse/SPARK-13728 Project: Spark Issue Type: Sub-task

[jira] [Resolved] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13694. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11532

[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Fix Version/s: (was: 1.6.0) > Bean encoder cannot handle nonbean properties - no

[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Target Version/s: 2.0.0 (was: 1.6.0) > Bean encoder cannot handle nonbean propert

[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Component/s: SQL > Bean encoder cannot handle nonbean properties - no way to Enc

[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Description: in the current environment the only way to turn a List or JavaRDD

<    5   6   7   8   9   10   11   12   13   14   >