[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959304#comment-14959304 ] Michael Armbrust commented on SPARK-10943: -- Yeah, parquet doesn't have a concept of null type

[jira] [Resolved] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10943. -- Resolution: Won't Fix > NullType Column cannot be written to Parq

[jira] [Updated] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10943: - Description: {code} var data02 = sqlContext.sql("select 1 as id, \"cat

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959347#comment-14959347 ] Michael Armbrust commented on SPARK-: - Yeah, that Scala code should work. Regarding the Java

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Michael Armbrust
> > In hive, the ambiguous name can be resolved by using the table name as > prefix, but seems DataFrame don't support it ( I mean DataFrame API rather > than SparkSQL) You can do the same using pure DataFrames. Seq((1,2)).toDF("a", "b").registerTempTable("y") Seq((1,4)).toDF("a",

Re: Spark SQL running totals

2015-10-15 Thread Michael Armbrust
Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: > you can do a self join of the table with itself with the join clause being > a.col1 >= b.col1 > > select

[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType

2015-10-15 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959451#comment-14959451 ] Michael Armbrust commented on SPARK-8658: - There is no query that exposes the problem as its

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by

Re: thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Michael Armbrust
Yes, call startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56 On Wed, Oct 14, 2015 at 7:10 AM, wrote: > Hi, > > Is it possible to

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957926#comment-14957926 ] Michael Armbrust commented on SPARK-: - [~sandyr] did you look at the test cases [in scala

[jira] [Created] (SPARK-11116) Initial API Draft

2015-10-14 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6: Summary: Initial API Draft Key: SPARK-6 URL: https://issues.apache.org/jira/browse/SPARK-6 Project: Spark Issue Type: Sub-task

Re: PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread Michael Armbrust
No link to the original stack overflow so I can up my reputation? :) This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Michael Armbrust
gt; Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> > wrote: > > import org.apache.spark.sql.functions._ > >

Re: Reusing Spark Functions

2015-10-14 Thread Michael Armbrust
Unless its a broadcast variable, a new copy will be deserialized for every task. On Wed, Oct 14, 2015 at 10:18 AM, Starch, Michael D (398M) < michael.d.sta...@jpl.nasa.gov> wrote: > All, > > Is a Function object in Spark reused on a given executor, or is sent and > deserialized with each new

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955368#comment-14955368 ] Michael Armbrust commented on SPARK-: - Other compatibility breaking things include: getting

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list >

[jira] [Created] (SPARK-11090) Initial code generated construction of Product classes from InternalRow

2015-10-13 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11090: Summary: Initial code generated construction of Product classes from InternalRow Key: SPARK-11090 URL: https://issues.apache.org/jira/browse/SPARK-11090

[jira] [Resolved] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11080. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9093

[jira] [Resolved] (SPARK-11090) Initial code generated construction of Product classes from InternalRow

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11090. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9100

[jira] [Updated] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10389: - Fix Version/s: 1.5.2 > support order by non-attribute grouping expression on Aggreg

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955878#comment-14955878 ] Michael Armbrust commented on SPARK-: - I think improving Java compatibility and getting rid

[jira] [Resolved] (SPARK-11032) Failure to resolve having correctly

2015-10-13 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11032. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9105

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-09 Thread Michael Armbrust
> > How about just fixing the warning? I get it; it doesn't stop this from > happening again, but still seems less drastic than tossing out the > whole mechanism. > +1 It also does not seem that expensive to test only compilation for Scala 2.11 on PR builds.

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-09 Thread Michael Armbrust
wso2.com> wrote: > Spark version: 1.4.1 > The schema is "barcode STRING, items INT" > > On Thu, Oct 8, 2015 at 10:48 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Hmm, that looks like it should work to me. What version of Spark? What >

[jira] [Created] (SPARK-11032) Failure to resolve having correctly

2015-10-09 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11032: Summary: Failure to resolve having correctly Key: SPARK-11032 URL: https://issues.apache.org/jira/browse/SPARK-11032 Project: Spark Issue Type: Bug

Re: error in sparkSQL 1.5 using count(1) in nested queries

2015-10-09 Thread Michael Armbrust
Thanks for reporting: https://issues.apache.org/jira/browse/SPARK-11032 You can probably workaround this by aliasing the count and just doing a filter on that value afterwards. On Thu, Oct 8, 2015 at 8:47 PM, Jeff Thompson < jeffreykeatingthomp...@gmail.com> wrote: > After upgrading from 1.4.1

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes. On Fri, Oct 9, 2015 at 12:01 PM, unk1102 wrote: > Hi how to calculate percentile of a column in a DataFrame? I cant find any > percentile_approx function in Spark aggregation functions. For

[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-10-09 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951092#comment-14951092 ] Michael Armbrust commented on SPARK-9776: - I want to get rid of hive context and only have

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
rom:* Umesh Kacha [mailto:umesh.ka...@gmail.com] > *Sent:* Friday, October 09, 2015 4:10 PM > *To:* Ellafi, Saif A. > *Cc:* Michael Armbrust; user > > *Subject:* Re: How to calculate percentile of a column of DataFrame? > > > > I found it in 1.3 documentation lit says somethi

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You can't do nested operations on RDDs or DataFrames (i.e. you can't create a DataFrame from within a map function). Perhaps if you explain what you are trying to accomplish someone can suggest another way. On Thu, Oct 8, 2015 at 10:10 AM, Afshartous, Nick wrote: > >

Re: Default size of a datatype in SparkSQL

2015-10-08 Thread Michael Armbrust
Its purely for estimation, when guessing when its safe to do a broadcast join. We picked a random number that we thought was larger than the common case (its better to over estimate to avoid OOM). On Wed, Oct 7, 2015 at 10:11 PM, vivek bhaskar wrote: > I want to understand

[jira] [Resolved] (SPARK-10998) Show non-children in default Expression.toString

2015-10-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10998. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9022

[jira] [Resolved] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8654. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8983 [https

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Which version of Spark? On Thu, Oct 8, 2015 at 7:25 AM, wrote: > Hi all, would this be a bug?? > > val ws = Window. > partitionBy("clrty_id"). > orderBy("filemonth_dtt") > > val nm = "repeatMe" >

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Michael Armbrust
Hmm, that looks like it should work to me. What version of Spark? What is the schema of goods? On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena wrote: > Hi, > > Suppose there is data frame called goods with columns "barcode" and > "items". Some of the values in the

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
val withoutAnalyticsId = sqlContext.sql("select * from ad_info > where deviceId = '%1s' order by messageTime desc limit 1" format (deviceId)) > >withoutAnalyticsId.take(1)(0) >} > }) > > > > > > Fro

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Michael Armbrust
Don't worry, the ability to work with domain objects and lambda functions is not going to go away. However, we are looking at ways to leverage Tungsten's improved performance when processing structured data. More details can be found here: https://issues.apache.org/jira/browse/SPARK- On

[jira] [Reopened] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-08 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-8654: - This broke tests and got reverted. > Analysis exception when using "NULL IN (...)&q

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
owNumber in HiveContext returns null or negative values > > > > Hi, thanks for looking into. v1.5.1. I am really worried. > > I dont have hive/hadoop for real in the environment. > > > > Saif > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com &

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

[jira] [Created] (SPARK-10998) Show non-children in default Expression.toString

2015-10-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-10998: Summary: Show non-children in default Expression.toString Key: SPARK-10998 URL: https://issues.apache.org/jira/browse/SPARK-10998 Project: Spark

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Michael Armbrust
> >> That sounds fine to me, we already do the filtering so populating that >> field would be pretty simple. >> >> On Sun, Sep 27, 2015 at 2:08 PM Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> We have to try and maintain binary compatib

[jira] [Resolved] (SPARK-10966) Code-generation framework cleanup

2015-10-07 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10966. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9006

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Michael Armbrust
> > At my company we use Avro heavily and it's not been fun when i've tried to > work with complex avro schemas and python. This may not be relevant to you > however...otherwise I found Python to be a great fit for Spark :) > Have you tried using https://github.com/databricks/spark-avro ? It

[jira] [Created] (SPARK-10966) Code-generation framework cleanup

2015-10-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-10966: Summary: Code-generation framework cleanup Key: SPARK-10966 URL: https://issues.apache.org/jira/browse/SPARK-10966 Project: Spark Issue Type: Bug

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Michael Armbrust
I believe this is fixed in Spark 1.5.1 as long as the table is only using types that hive understands and is not partitioned. The problem with partitioned tables it that hive does not support dynamic discovery unless you manually run the repair command. On Tue, Oct 6, 2015 at 9:33 AM, Umesh

Re: spark hive branch location

2015-10-05 Thread Michael Armbrust
I think this is the most up to date branch (used in Spark 1.5): https://github.com/pwendell/hive/tree/release-1.2.1-spark On Mon, Oct 5, 2015 at 1:03 PM, weoccc wrote: > Hi, > > I would like to know where is the spark hive github location where spark > build depend on ? I was

Re: String operation in filter with a special character

2015-10-05 Thread Michael Armbrust
Double quotes (") are used to create string literals in HiveQL / Spark SQL. So you are asking if the string A+B equals the number 2.0. You should use backticks (`) to escape weird characters in column names. On Mon, Oct 5, 2015 at 12:59 AM, Hemminger Jeff wrote: > I have a

Re: Spark context on thrift server

2015-10-05 Thread Michael Armbrust
Isolation for different sessions will hopefully be fixed by https://github.com/apache/spark/pull/8909 On Mon, Oct 5, 2015 at 8:38 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’re using a spark thrift server and we connect using jdbc to run queries. > > Every time

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Michael Armbrust
gt; Fernando Paladini. > > 2015-10-05 15:23 GMT-03:00 Fernando Paladini <fnpalad...@gmail.com>: > >> Thank you for the replies and sorry about the delay, my e-mail client >> send this conversation to Spam (??). >> >> I'll take a look in your tips and c

[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2015-10-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944340#comment-14944340 ] Michael Armbrust commented on SPARK-9776: - You should not create a HiveContext in the spark-shell

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread Michael Armbrust
It does do a take. Run explain to make sure that is the case. Why do you think its reading the whole table? On Mon, Oct 5, 2015 at 1:53 PM, YaoPau wrote: > I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * FROM > my_db.my_tbl LIMIT 5", it scans the

[jira] [Assigned] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-: --- Assignee: Michael Armbrust > RDD-like API on top of Catalyst/DataFr

[jira] [Updated] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-05 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-: Description: The RDD API is very flexible, and as a result harder to optimize its execution

Re: performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Michael Armbrust
Underneath the covers, the thrift server is just calling hiveContext.sql(...) so this is surprising. Maybe running EXPLAIN or EXPLAIN

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
import org.apache.spark.sql.functions.* callUDF("MyUDF", col("col1"), col("col2")) On Fri, Oct 2, 2015 at 6:25 AM, unk1102 wrote: > Hi I have registed my hive UDF using the following code: > > hiveContext.udf().register("MyUDF",new UDF1(String,String)) { > public String

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try df.write.partitionBy("", "mm").save("..."). On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram < haridass.saisri...@gmail.com> wrote: > Hi, > > I am trying to find a simple example to read a data file on HDFS. The > file

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
me for resultant columns? For e.g. > when using > > hiveContext.sql("select MyUDF("test") as mytest from myTable"); > > how do we do that in DataFrame callUDF > > callUDF("MyUDF", col("col1"))??? > > On Fri, Oct 2, 2015 at 8:23 PM,

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Michael Armbrust
I think the problem here is that you are passing in parsed JSON that stored as a dictionary (which is converted to a hashmap when going into the JVM). You should instead be passing in the path to the json file (formatted as Akhil suggests) so that Spark can do the parsing in parallel. The other

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
Thats a pretty advanced example that uses experimental APIs. I'd suggest looking at https://github.com/databricks/spark-avro as a reference. On Mon, Sep 28, 2015 at 9:00 PM, Ted Yu wrote: > See this thread: > > http://search-hadoop.com/m/q3RTttmiYDqGc202 > > And: > >

Re: unintended consequence of using coalesce operation

2015-09-29 Thread Michael Armbrust
coalesce is generally to avoid launching too many tasks, on a bunch of small files. As a result, the goal is to reduce parallelism (when the overhead of that parallelism is more costly than the gain). You are correct that in you case repartition sounds like a better choice. On Tue, Sep 29, 2015

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
ce, it can be used in all spark supported language? That is Scala, > Java, Python and R. :) > I want to take advantage of the interoperability that is already built in > spark. > > Thanks! > > Jerry > > On Tue, Sep 29, 2015 at 11:31 AM, Michael Armbrust <mich...@databr

Re: Spark SQL deprecating Hive? How will I access Hive metadata in the future?

2015-09-29 Thread Michael Armbrust
We are not deprecating HiveQL, nor the ability to read metadata from the metastore. On Tue, Sep 29, 2015 at 12:24 PM, YaoPau wrote: > I've heard that Spark SQL will be or has already started deprecating HQL. > We > have Spark SQL + Python jobs that currently read from the

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
Another note: for best performance you are going to want your parquet files to be pretty big (100s of mb). You could coalesce them and write them out for more efficient repeat querying. On Mon, Sep 28, 2015 at 2:00 PM, Michael Armbrust <mich...@databricks.com> wrote: > sqlContext.rea

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
sqlContext.read.parquet takes lists of files. val fileList = sc.textFile("file_list.txt").collect() // this works but using spark is possibly overkill val dataFrame =

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-27 Thread Michael Armbrust
ach purchase_items? > Since purchase_items is an array of item and each item has a number of > fields (for example product_id and price), is it possible to just explode > these two fields directly using dataframe? > > Best Regards, > > > Jerry > > On Fri, Sep 25, 2015 at 7

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-09-27 Thread Michael Armbrust
We have to try and maintain binary compatibility here, so probably the easiest thing to do here would be to add a method to the class. Perhaps something like: def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters By default, this could return all filters so behavior would remain

Re: Reading Hive Tables using SQLContext

2015-09-25 Thread Michael Armbrust
lude Hive > tables from SQLContext. > > -Sathish > > On Thu, Sep 24, 2015 at 7:46 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> No, you have to use a HiveContext. >> >> On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairave

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Michael Armbrust
The SQL parser without HiveContext is really simple, which is why I generally recommend users use HiveContext. However, you can do it with dataframes: import org.apache.spark.sql.functions._ table("purchases").select(explode(df("purchase_items")).as("item")) On Fri, Sep 25, 2015 at 4:21 PM,

Re: Spark for Oracle sample code

2015-09-25 Thread Michael Armbrust
In most cases predicates that you add to jdbcDF will be push down into oracle, preventing the whole table from being sent over. df.where("column = 1") Another common pattern is to save the table to parquet or something for repeat querying. Michael On Fri, Sep 25, 2015 at 3:13 PM, Cui Lin

Re: Querying on multiple Hive stores using Apache Spark

2015-09-24 Thread Michael Armbrust
This is not supported yet, though, we laid a lot of the ground work for doing this in Spark 1.4. On Wed, Sep 23, 2015 at 11:17 PM, Karthik wrote: > Any ideas or suggestions? > > Thanks, > Karthik. > > > > -- > View this message in context: >

Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Michael Armbrust
No, you have to use a HiveContext. On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello, > > Is it possible to access Hive tables directly from SQLContext instead of > HiveContext? I am facing with errors while doing it. > > Please let me know >

[jira] [Resolved] (SPARK-10494) Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10494. -- Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 1.5.1

[jira] [Updated] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10297: - Target Version/s: 1.6.0 (was: 1.6.0, 1.5.1) > When save data to a data source table,

[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904909#comment-14904909 ] Michael Armbrust commented on SPARK-10659: -- /cc [~lian cheng] > DataFrames and Spark

[jira] [Updated] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10294: - Target Version/s: 1.6.0 (was: 1.5.1) > When Parquet writer's close method thr

[jira] [Commented] (SPARK-10727) Dataframe count is zero after 'except' operation

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904900#comment-14904900 ] Michael Armbrust commented on SPARK-10727: -- Thanks for reporting. This should be fixed in 1.5.1

[jira] [Updated] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10765: - Assignee: Wenchen Fan > use new aggregate interface for hive U

[jira] [Updated] (SPARK-10448) Parquet schema merging should NOT merge UDT

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10448: - Target Version/s: 1.6.0 (was: 1.6.0, 1.5.1) > Parquet schema merging should NOT me

[jira] [Resolved] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10403. -- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved

[jira] [Updated] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-23 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10765: - Target Version/s: 1.6.0 > use new aggregate interface for hive U

[jira] [Resolved] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-22 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10485. -- Resolution: Fixed I tested on 1.5 and it seems fixed to me. Please reopen if you have

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
e.co.uk > 07714140812 > > > > On 22 September 2015 at 19:28, Michael Armbrust <mich...@databricks.com> > wrote: > >> I think that you are hitting a bug (which should be fixed in Spark >> 1.5.1). I'm hoping we can cut an RC for that this week. Until then

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
I think that you are hitting a bug (which should be fixed in Spark 1.5.1). I'm hoping we can cut an RC for that this week. Until then you could try building branch-1.5. On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar wrote: > Hi > > I am trying to write an UDAF

Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark

Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
Are you using a SQLContext or a HiveContext? The programming guide suggests the latter, as the former is really only there because some applications may have conflicts with Hive dependencies. SQLContext is case sensitive by default where as the HiveContext is not. The parser in HiveContext is

Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
entifiers are swallowed up: > > // this now returns rows consisting of the string literal "cd" > sqlContext.sql("""select "c""d" from test_data""").show > > Thanks, > -Rick > > Michael Armbrust <mich...@databric

Re: HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Michael Armbrust
In general we welcome pull requests for these kind of updates. In this case its already been fixed in master and branch-1.5 and will be updated when we release 1.5.1 (hopefully soon). On Mon, Sep 21, 2015 at 1:21 PM, Dominic Ricard < dominic.ric...@tritondigital.com> wrote: > Hi, >here's a

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.1" Though, I would consider using spark-hive and HiveContext, as the query parser is more powerful and you'll have access to window functions and other features. On Thu, Sep 17, 2015 at 10:59 AM, Cui Lin

[jira] [Resolved] (SPARK-10650) Spark docs include test and other extra classes

2015-09-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10650. -- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
Context, do I have to setup Hive on server? Can > I run this on my local laptop? > > On Thu, Sep 17, 2015 at 11:02 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.1" &g

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Michael Armbrust
from pyspark.sql.functions import * ​ df = sqlContext.range(10).select(rand().alias("a"), rand().alias("b")) df.where("a > b").show() (2) Spark Jobs +--+---+ | a| b| +--+---+ |0.6697439215581628|0.23420961030968923|

[jira] [Resolved] (SPARK-10639) Need to convert UDAF's result from scala to sql type

2015-09-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10639. -- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved

[jira] [Resolved] (SPARK-10667) Add 86 Runnable TPCDS Queries into spark-sql-perf

2015-09-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10667. -- Resolution: Invalid Yeah, can we move this over to github issues? > Add 86 Runna

[jira] [Updated] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat

2015-09-17 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10058: - Priority: Critical (was: Blocker) > Flaky test: HeartbeatReceiverSuite: nor

[jira] [Updated] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10485: - Affects Version/s: 1.5.0 > IF expression is not correctly resolved when

[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10650: - Assignee: Michael Armbrust (was: Andrew Or) > Spark docs include test and other ex

[jira] [Resolved] (SPARK-6504) Cannot read Parquet files generated from different versions at once

2015-09-16 Thread Michael Armbrust (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6504. - Resolution: Fixed Fix Version/s: 1.3.1 This should be fixed. Please reopen if you

<    13   14   15   16   17   18   19   20   21   22   >