Re: String operation in filter with a special character

2015-10-05 Thread Michael Armbrust
Double quotes (") are used to create string literals in HiveQL / Spark SQL. So you are asking if the string A+B equals the number 2.0. You should use backticks (`) to escape weird characters in column names. On Mon, Oct 5, 2015 at 12:59 AM, Hemminger Jeff wrote: > I have a

Re: Spark context on thrift server

2015-10-05 Thread Michael Armbrust
Isolation for different sessions will hopefully be fixed by https://github.com/apache/spark/pull/8909 On Mon, Oct 5, 2015 at 8:38 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’re using a spark thrift server and we connect using jdbc to run queries. > > Every time

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Michael Armbrust
gt; Fernando Paladini. > > 2015-10-05 15:23 GMT-03:00 Fernando Paladini <fnpalad...@gmail.com>: > >> Thank you for the replies and sorry about the delay, my e-mail client >> send this conversation to Spam (??). >> >> I'll take a look in your tips and c

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread Michael Armbrust
It does do a take. Run explain to make sure that is the case. Why do you think its reading the whole table? On Mon, Oct 5, 2015 at 1:53 PM, YaoPau wrote: > I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * FROM > my_db.my_tbl LIMIT 5", it scans the

Re: performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Michael Armbrust
Underneath the covers, the thrift server is just calling hiveContext.sql(...) so this is surprising. Maybe running EXPLAIN or EXPLAIN

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
import org.apache.spark.sql.functions.* callUDF("MyUDF", col("col1"), col("col2")) On Fri, Oct 2, 2015 at 6:25 AM, unk1102 wrote: > Hi I have registed my hive UDF using the following code: > > hiveContext.udf().register("MyUDF",new UDF1(String,String)) { > public String

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try df.write.partitionBy("", "mm").save("..."). On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram < haridass.saisri...@gmail.com> wrote: > Hi, > > I am trying to find a simple example to read a data file on HDFS. The > file

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
me for resultant columns? For e.g. > when using > > hiveContext.sql("select MyUDF("test") as mytest from myTable"); > > how do we do that in DataFrame callUDF > > callUDF("MyUDF", col("col1"))??? > > On Fri, Oct 2, 2015 at 8:23 PM,

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Michael Armbrust
I think the problem here is that you are passing in parsed JSON that stored as a dictionary (which is converted to a hashmap when going into the JVM). You should instead be passing in the path to the json file (formatted as Akhil suggests) so that Spark can do the parsing in parallel. The other

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
Thats a pretty advanced example that uses experimental APIs. I'd suggest looking at https://github.com/databricks/spark-avro as a reference. On Mon, Sep 28, 2015 at 9:00 PM, Ted Yu wrote: > See this thread: > > http://search-hadoop.com/m/q3RTttmiYDqGc202 > > And: > >

Re: unintended consequence of using coalesce operation

2015-09-29 Thread Michael Armbrust
coalesce is generally to avoid launching too many tasks, on a bunch of small files. As a result, the goal is to reduce parallelism (when the overhead of that parallelism is more costly than the gain). You are correct that in you case repartition sounds like a better choice. On Tue, Sep 29, 2015

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Michael Armbrust
ce, it can be used in all spark supported language? That is Scala, > Java, Python and R. :) > I want to take advantage of the interoperability that is already built in > spark. > > Thanks! > > Jerry > > On Tue, Sep 29, 2015 at 11:31 AM, Michael Armbrust <mich...@databr

Re: Spark SQL deprecating Hive? How will I access Hive metadata in the future?

2015-09-29 Thread Michael Armbrust
We are not deprecating HiveQL, nor the ability to read metadata from the metastore. On Tue, Sep 29, 2015 at 12:24 PM, YaoPau wrote: > I've heard that Spark SQL will be or has already started deprecating HQL. > We > have Spark SQL + Python jobs that currently read from the

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
Another note: for best performance you are going to want your parquet files to be pretty big (100s of mb). You could coalesce them and write them out for more efficient repeat querying. On Mon, Sep 28, 2015 at 2:00 PM, Michael Armbrust <mich...@databricks.com> wrote: > sqlContext.rea

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
sqlContext.read.parquet takes lists of files. val fileList = sc.textFile("file_list.txt").collect() // this works but using spark is possibly overkill val dataFrame =

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-27 Thread Michael Armbrust
ach purchase_items? > Since purchase_items is an array of item and each item has a number of > fields (for example product_id and price), is it possible to just explode > these two fields directly using dataframe? > > Best Regards, > > > Jerry > > On Fri, Sep 25, 2015 at 7

Re: Reading Hive Tables using SQLContext

2015-09-25 Thread Michael Armbrust
lude Hive > tables from SQLContext. > > -Sathish > > On Thu, Sep 24, 2015 at 7:46 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> No, you have to use a HiveContext. >> >> On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairave

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Michael Armbrust
The SQL parser without HiveContext is really simple, which is why I generally recommend users use HiveContext. However, you can do it with dataframes: import org.apache.spark.sql.functions._ table("purchases").select(explode(df("purchase_items")).as("item")) On Fri, Sep 25, 2015 at 4:21 PM,

Re: Spark for Oracle sample code

2015-09-25 Thread Michael Armbrust
In most cases predicates that you add to jdbcDF will be push down into oracle, preventing the whole table from being sent over. df.where("column = 1") Another common pattern is to save the table to parquet or something for repeat querying. Michael On Fri, Sep 25, 2015 at 3:13 PM, Cui Lin

Re: Querying on multiple Hive stores using Apache Spark

2015-09-24 Thread Michael Armbrust
This is not supported yet, though, we laid a lot of the ground work for doing this in Spark 1.4. On Wed, Sep 23, 2015 at 11:17 PM, Karthik wrote: > Any ideas or suggestions? > > Thanks, > Karthik. > > > > -- > View this message in context: >

Re: Reading Hive Tables using SQLContext

2015-09-24 Thread Michael Armbrust
No, you have to use a HiveContext. On Thu, Sep 24, 2015 at 2:47 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello, > > Is it possible to access Hive tables directly from SQLContext instead of > HiveContext? I am facing with errors while doing it. > > Please let me know >

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
e.co.uk > 07714140812 > > > > On 22 September 2015 at 19:28, Michael Armbrust <mich...@databricks.com> > wrote: > >> I think that you are hitting a bug (which should be fixed in Spark >> 1.5.1). I'm hoping we can cut an RC for that this week. Until then

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
I think that you are hitting a bug (which should be fixed in Spark 1.5.1). I'm hoping we can cut an RC for that this week. Until then you could try building branch-1.5. On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar wrote: > Hi > > I am trying to write an UDAF

Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark

Re: HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Michael Armbrust
In general we welcome pull requests for these kind of updates. In this case its already been fixed in master and branch-1.5 and will be updated when we release 1.5.1 (hopefully soon). On Mon, Sep 21, 2015 at 1:21 PM, Dominic Ricard < dominic.ric...@tritondigital.com> wrote: > Hi, >here's a

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.1" Though, I would consider using spark-hive and HiveContext, as the query parser is more powerful and you'll have access to window functions and other features. On Thu, Sep 17, 2015 at 10:59 AM, Cui Lin

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
Context, do I have to setup Hive on server? Can > I run this on my local laptop? > > On Thu, Sep 17, 2015 at 11:02 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.1" &g

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Michael Armbrust
from pyspark.sql.functions import * ​ df = sqlContext.range(10).select(rand().alias("a"), rand().alias("b")) df.where("a > b").show() (2) Spark Jobs +--+---+ | a| b| +--+---+ |0.6697439215581628|0.23420961030968923|

Re: selecting columns with the same name in a join

2015-09-11 Thread Michael Armbrust
Here is what I get on branch-1.5: x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF() x.registerTempTable('x') y.registerTempTable('y') sqlContext.sql("select y.v, x.v FROM x INNER JOIN y ON

Re: Custom UDAF Evaluated Over Window

2015-09-10 Thread Michael Armbrust
The only way to do this today is to write it as a Hive UDAF. We hope to improve the window functions to use our native aggregation in a future release. On Thu, Sep 10, 2015 at 12:26 AM, xander92 wrote: > While testing out the new UserDefinedAggregateFunction in

Re: Creating Parquet external table using HiveContext API

2015-09-10 Thread Michael Armbrust
Easiest is to just use SQL: hiveContext.sql("CREATE TABLE USING parquet OPTIONS (path '')") When you specify the path its automatically created as an external table. The schema will be discovered. On Wed, Sep 9, 2015 at 9:33 PM, Mohammad Islam wrote: > Hi, > I

Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Michael Armbrust
I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising. In my experiments Spark 1.5 is either the same or faster than 1.4 with only small exceptions. A few thoughts, - 600 partitions is probably way too many for 6G of data. - Providing the output of

Re: Avoiding SQL Injection in Spark SQL

2015-09-10 Thread Michael Armbrust
Either that or use the DataFrame API, which directly constructs query plans and thus doesn't suffer from injection attacks (and runs on the same execution engine). On Thu, Sep 10, 2015 at 12:10 AM, Sean Owen wrote: > I don't think this is Spark-specific. Mostly you need to

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
Also, do you mean two partitions or two partition columns? If there are many partitions it can be much slower. In Spark 1.5 I'd consider setting spark.sql.hive.metastorePartitionPruning=true if you have predicates over the partition columns. On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
What format is this table. For parquet and other optimized formats we cache a bunch of file metadata on first access to make interactive queries faster. On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote: > Hello, > > I am using SparkSQL to query some Hive tables. Most of

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
ny can move to 1.5. Would you know some workaround > for this bug? > If I cannot find workaround for this, will have to change our schema > design to reduce number of partitions. > > > Thanks, > > Isabelle > > > > On Fri, Sep 4, 2015 at 12:56 PM, Michael Armb

Re: spark 1.5 sort slow

2015-09-02 Thread Michael Armbrust
Can you include the output of `explain()` for each of the runs? On Tue, Sep 1, 2015 at 1:06 AM, patcharee wrote: > Hi, > > I found spark 1.5 sorting is very slow compared to spark 1.4. Below is my > code snippet > > val sqlRDD = sql("select date, u, v, z from

Re: wild cards in spark sql

2015-09-02 Thread Michael Armbrust
That query should work. On Wed, Sep 2, 2015 at 1:50 PM, Hafiz Mujadid wrote: > Hi > > does spark sql support wild cards to filter data in sql queries just like > we > can filter data in sql queries in RDBMS with different wild cards like % > and > ? etc. In other words

Re: Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread Michael Armbrust
Before Spark 1.5, tables created using saveAsTable cannot be queried by Hive because we only store Spark SQL metadata. In Spark 1.5 for parquet and ORC we store both, but this will not work with partitioned tables because hive does not support dynamic partition discovery. On Wed, Sep 2, 2015 at

Re: query avro hive table in spark sql

2015-08-27 Thread Michael Armbrust
BTY, spark-avro works great for our experience, but still, some non-tech people just want to use as a SQL shell in spark, like HIVE-CLI. To clarify: you can still use the spark-avro library with pure SQL. Just use the CREATE TABLE ... USING com.databricks.spark.avro OPTIONS (path '...')

Re: Data Frame support CSV or excel format ?

2015-08-27 Thread Michael Armbrust
Check out spark-csv: http://spark-packages.org/package/databricks/spark-csv On Thu, Aug 27, 2015 at 11:48 AM, spark user spark_u...@yahoo.com.invalid wrote: Hi all , Can we create data frame from excels sheet or csv file , in below example It seems they support only json ? DataFrame df =

Re: Differing performance in self joins

2015-08-26 Thread Michael Armbrust
-dev +user I'd suggest running .explain() on both dataframes to understand the performance better. The problem is likely that we have a pattern that looks for cases where you have an equality predicate where either side can be evaluated using one side of the join. We turn this into a hash join.

Re: query avro hive table in spark sql

2015-08-26 Thread Michael Armbrust
I'd suggest looking at http://spark-packages.org/package/databricks/spark-avro On Wed, Aug 26, 2015 at 11:32 AM, gpatcham gpatc...@gmail.com wrote: Hi, I'm trying to query hive table which is based on avro in spark SQL and seeing below errors. 15/08/26 17:51:12 WARN avro.AvroSerdeUtils:

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Michael Armbrust
I'd suggest setting sbt to fork when running tests. On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis mike.trie...@orcsol.com wrote: Thanks for your response Yana, I can increase the MaxPermSize parameter and it will allow me to run the unit test a few more times before I run out of memory.

Re: What does Attribute and AttributeReference mean in Spark SQL

2015-08-25 Thread Michael Armbrust
Attribute is the Catalyst name for an input column from a child operator. An AttributeReference has been resolved, meaning we know which input column in particular it is referring too. An AttributeReference also has a known DataType. In contrast, before analysis there might still exist

Re: Drop table and Hive warehouse

2015-08-24 Thread Michael Armbrust
Thats not the expected behavior. What version of Spark? On Mon, Aug 24, 2015 at 1:32 AM, Kevin Jung itsjb.j...@samsung.com wrote: When I store DataFrame as table with command saveAsTable and then execute DROP TABLE in SparkSQL, it doesn't actually delete files in hive warehouse. The table

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
footers within the folders? On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Monday, August 24, 2015

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
Follow the directions here: http://spark.apache.org/community.html On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com wrote: Can you please remove me from this distribution list? (Filling up my inbox too fast) *From:* Michael Armbrust [mailto:mich

Re: Array Out OF Bound Exception

2015-08-24 Thread Michael Armbrust
This top line here is indicating that the exception is being throw from your code (i.e. code written in the console). at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) Check to make sure that you are properly handling data

Re: DataFrame/JDBC very slow performance

2015-08-24 Thread Michael Armbrust
Much appreciated! I am not comparing with select count(*) for performance, but it was one simple thing I tried to check the performance :). I think it now makes sense since Spark tries to extract all records before doing the count. I thought having an aggregated function query submitted over

Re: SparkSQL concerning materials

2015-08-23 Thread Michael Armbrust
Here's a longer version of that talk that I gave, which goes into more detail on the internals: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune On Fri, Aug 21, 2015 at 8:28 AM, Sameer Farooqui same...@databricks.com wrote: Have you seen the Spark SQL paper?:

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Michael Armbrust
We should not be actually scanning all of the data of all of the partitions, but we do need to at least list all of the available directories so that we can apply your predicates to the actual values that are present when we are deciding which files need to be read in a given spark job. While

Re: DataFrameWriter.jdbc is very slow

2015-08-20 Thread Michael Armbrust
We will probably fix this in Spark 1.6 https://issues.apache.org/jira/browse/SPARK-10040 On Thu, Aug 20, 2015 at 5:18 AM, Aram Mkrtchyan aram.mkrtchyan...@gmail.com wrote: We want to migrate our data (approximately 20M rows) from parquet to postgres, when we are using dataframe writer's jdbc

Re: Data frame created from hive table and its partition

2015-08-20 Thread Michael Armbrust
There is no such thing as primary keys in the Hive metastore, but Spark SQL does support partitioned hive tables: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables DataFrameWriter also has a partitionBy method. On Thu, Aug 20, 2015 at 7:29

Re: Json Serde used by Spark Sql

2015-08-18 Thread Michael Armbrust
Under the covers we use Jackson's Streaming API as of Spark 1.4. On Tue, Aug 18, 2015 at 1:12 PM, Udit Mehta ume...@groupon.com wrote: Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then

Re: Does Spark optimization might miss to run transformation?

2015-08-13 Thread Michael Armbrust
-dev If you want to guarantee the side effects happen you should use foreach or foreachPartitions. A `take`, for example, might only evaluate a subset of the partitions until it find enough results. On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov fathers...@list.ru wrote: Hi! I’d like to

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Michael Armbrust
Hey sorry, I've been doing a bunch of refactoring on this project. Most of the data generation was a huge hack (it was done before we supported partitioning natively) and used some private APIs that don't exist anymore. As a result, while doing the regression tests for 1.5 I deleted a bunch of

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
Older versions of Spark (i.e. when it was still called SchemaRDD instead of DataFrame) did not support merging different parquet schema. However, Spark 1.4+ should. On Sat, Aug 8, 2015 at 8:58 PM, sim s...@swoop.com wrote: Adam, did you find a solution for this? -- View this message in

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Michael Armbrust
You can't create a DataFrame from an arbitrary object since we don't know how to figure out the schema. You can either create a JavaBean https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema or manually create a row + specify the schema

Re: Pagination on big table, splitting joins

2015-08-10 Thread Michael Armbrust
I think to use *toLocalIterator* method and something like that, but I have doubts about memory and parallelism and sure there is a better way to do it. It will still run all earlier parts of the job in parallel. Only the actual retrieving of the final partitions will be serial. This is

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Michael Armbrust
You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry jerry.c...@gmail.com wrote: Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
this works with the schema changing over time? Must the Hive tables be set up as external tables outside of saveAsTable? In my experience, in 1.4.1, writing to a table with SaveMode.Append fails if the schema don't match. Thanks, Sim From: Michael Armbrust mich...@databricks.com Date: Monday

Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
You can register your data as a table using this library and then query it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path src/test/resources/episodes.avro) On Fri, Aug 7, 2015 at 11:42 AM, java8964 java8...@hotmail.com wrote: Hi, Michael: I am not

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Michael Armbrust
This feature isn't currently supported. On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com wrote: Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Michael Armbrust
I'll add that while Spark SQL 1.5 compiles against Hive 1.2.1, it has support for reading from metastores for Hive 0.12 - 1.2.1 On Tue, Aug 4, 2015 at 9:59 AM, Steve Loughran ste...@hortonworks.com wrote: Spark 1.3.1 1.4 only support Hive 0.13 Spark 1.5 is going to be released against Hive

Re: shutdown local hivecontext?

2015-08-03 Thread Michael Armbrust
TestHive takes care of creating a temporary directory for each invocation so that multiple test runs won't conflict. On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores ces...@gmail.com wrote: We are using a local hive context in order to run unit tests. Our unit tests runs perfectly fine if we run

Re: how to ignore MatchError then processing a large json file in spark-sql

2015-08-03 Thread Michael Armbrust
This sounds like a bug. What version of spark? and can you provide the stack trace? On Sun, Aug 2, 2015 at 11:27 AM, fuellee lee lifuyu198...@gmail.com wrote: I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it

Re: how to convert a sequence of TimeStamp to a dataframe

2015-08-03 Thread Michael Armbrust
In general it needs to be a Seq of Tuples for the implicit toDF to work (which is a little tricky when there is only one column). scala Seq(Tuple1(new java.sql.Timestamp(System.currentTimeMillis))).toDF(a) res3: org.apache.spark.sql.DataFrame = [a: timestamp] or with multiple columns scala

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
outer joins work.* Cheers and thanks, Martin 2015-07-30 20:23 GMT+02:00 Michael Armbrust mich...@databricks.com: We don't yet updated nullability information based on predicates as we don't actually leverage this information in many places yet. Why do you want to update the schema? On Thu

Re: [Parquet + Dataframes] Column names with spaces

2015-07-30 Thread Michael Armbrust
You can't use these names due to limitations in parquet (and the library it self with silently generate corrupt files that can't be read, hence the error we throw). You can alias a column by df.select(df(old).alias(new)), which is essential what withColumnRenamed does. Alias in this case means

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-30 Thread Michael Armbrust
We don't yet updated nullability information based on predicates as we don't actually leverage this information in many places yet. Why do you want to update the schema? On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 martin.se...@googlemail.com wrote: Hi all, 1. *Columns in dataframes can be

Re: How to read a Json file with a specific format?

2015-07-29 Thread Michael Armbrust
This isn't totally correct. Spark SQL does support JSON arrays and will implicitly flatten them. However, complete objects or arrays must exist one per line and cannot be split with newlines. On Wed, Jul 29, 2015 at 7:55 AM, Young, Matthew T matthew.t.yo...@intel.com wrote: The built-in

Re: DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Michael Armbrust
We will try to address this before Spark 1.5 is released: https://issues.apache.org/jira/browse/SPARK-9141 On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic kpl...@gmail.com wrote: Hi, I'm puzzling over the following problem: when I cache a small sample of a big dataframe, the

Re: GenericRowWithSchema is too heavy

2015-07-27 Thread Michael Armbrust
Internally I believe that we only actually create one struct object for each row, so you are really only paying the cost of the pointer in most use cases (as shown below). scala val df = Seq((1,2), (3,4)).toDF(a, b) df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala df.collect() res1:

Re: Issue with column named count in a DataFrame

2015-07-22 Thread Michael Armbrust
Additionally have you tried enclosing count in `backticks`? On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust mich...@databricks.com wrote: I believe this will be fixed in Spark 1.5 https://github.com/apache/spark/pull/7237 On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T matthew.t.yo

Re: Issue with column named count in a DataFrame

2015-07-22 Thread Michael Armbrust
I believe this will be fixed in Spark 1.5 https://github.com/apache/spark/pull/7237 On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T matthew.t.yo...@intel.com wrote: I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior

Re: Add column to DF

2015-07-21 Thread Michael Armbrust
Try instead: import org.apache.spark.sql.functions._ val determineDayPartID = udf((evntStDate: String, evntStHour: String) = { val stFormat = new java.text.SimpleDateFormat(yyMMdd) var stDateStr:String = evntStDate.substring(2,8) val stDate:Date = stFormat.parse(stDateStr) val

Re: Question on Spark SQL for a directory

2015-07-21 Thread Michael Armbrust
https://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically On Tue, Jul 21, 2015 at 4:06 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing

Re: dataframes sql order by not total ordering

2015-07-20 Thread Michael Armbrust
An ORDER BY needs to be on the outermost query otherwise subsequent operations (such as the join) could reorder the tuples. On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald cmcdon...@maprtech.com wrote: the following query on the Movielens dataset , is sorting by the count of ratings for a

Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Michael Armbrust
Using a hive-site.xml file on the classpath. On Fri, Jul 17, 2015 at 8:37 AM, spark user spark_u...@yahoo.com.invalid wrote: Hi Roberto I have question regarding HiveContext . when you create HiveContext where you define Hive connection properties ? Suppose Hive is not in local machine i

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
I'll add there is a JIRA to override the default past some threshold of # of unique keys: https://issues.apache.org/jira/browse/SPARK-4476 https://issues.apache.org/jira/browse/SPARK-4476 On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust mich...@databricks.com wrote: The difference between

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
The difference between a map and a struct here is that in a struct all possible keys are defined as part of the schema and can each can have a different type (and we don't support union types). JSON doesn't have differentiated data structures so we go with the one that gives you more information

Re: Data frames select and where clause dependency

2015-07-17 Thread Michael Armbrust
Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis

Re: PairRDDFunctions and DataFrames

2015-07-16 Thread Michael Armbrust
Instead of using that RDD operation just use the native DataFrame function approxCountDistinct https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Jul 16, 2015 at 6:58 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, could someone point me to

Re: Basic Spark SQL question

2015-07-13 Thread Michael Armbrust
I'd look at the JDBC server (a long running yarn job you can submit queries too) https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Well for adhoc queries you can use

Re: [Spark Hive SQL] Set the hive connection in hive context is broken in spark 1.4.1-rc1?

2015-07-10 Thread Michael Armbrust
Metastore configuration should be set in hive-site.xml. On Thu, Jul 9, 2015 at 8:59 PM, Terry Hole hujie.ea...@gmail.com wrote: Hi, I am trying to set the hive metadata destination to a mysql database in hive context, it works fine in spark 1.3.1, but it seems broken in spark 1.4.1-rc1,

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819 https://issues.apache.org/jira/browse/SPARK-7819 Typically you can work around this by making sure that the classes are shared across the isolation boundary, as discussed in the comments. On Tue, Jul 7, 2015 at 3:29 AM, Sea

Re: DataFrame question

2015-07-07 Thread Michael Armbrust
You probably want to explode the array to produce one row per element: df.select(explode(df(links)).alias(link)) On Tue, Jul 7, 2015 at 10:29 AM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi All, I am working with dataframes and have been struggling with this thing, any pointers would be

Re: Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Michael Armbrust
Use the built in JDBC data source: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Jul 6, 2015 at 6:42 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi all! what is the most efficient way to convert jdbcRDD to DataFrame. any example?

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-01 Thread Michael Armbrust
You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below.

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Michael Armbrust
I would still look at your executor logs. A count() is rewritten by the optimizer to be much more efficient because you don't actually need any of the columns. Also, writing parquet allocates quite a few large buffers. On Wed, Jul 1, 2015 at 5:42 AM, Pooja Jain pooja.ja...@gmail.com wrote:

Re: Custom order by in Spark SQL

2015-07-01 Thread Michael Armbrust
Easiest way to do this today is to define a UDF that maps from string to a number. On Wed, Jul 1, 2015 at 10:25 AM, Mick Davies michael.belldav...@gmail.com wrote: Hi, Is there a way to specify a custom order by (Ordering) on a column in Spark SQL In particular I would like to have the

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Michael Armbrust
There is an isNotNull function on any column. df._1.isNotNull or from pyspark.sql.functions import * col(myColumn).isNotNull On Wed, Jul 1, 2015 at 3:07 AM, Olivier Girardot ssab...@gmail.com wrote: I must admit I've been using the same back to SQL strategy for now :p So I'd be glad to have

Re: BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Michael Armbrust
We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though ( https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L581 ) On Wed, Jul 1, 2015 at 8:30 AM, Srikanth srikanth...@gmail.com wrote:

Re: Subsecond queries possible?

2015-06-30 Thread Michael Armbrust
This brings up another question/issue - there doesn't seem to be a way to partition cached tables in the same way you can partition, say a Hive table. For example, we would like to partition the overall dataset (233m rows, 9.2Gb) by (product, coupon) so when we run one of these queries

Re: sql dataframe internal representation

2015-06-25 Thread Michael Armbrust
In many cases we use more efficient mutable implementations internally (i.e. mutable undecoded utf8 instead of java.lang.String, or a BigDecimal implementation that uses a Long when the number is small enough). On Thu, Jun 25, 2015 at 1:56 PM, Koert Kuipers ko...@tresata.com wrote: i noticed in

Re: [sparksql] sparse floating point data compression in sparksql cache

2015-06-24 Thread Michael Armbrust
Have you considered instead using the mllib SparseVector type (which is supported in Spark SQL?) On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov n...@beckon.com wrote: When my 22M Parquet test file ended up taking 3G when cached in-memory I looked closer at how column compression works in

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Michael Armbrust
Starting in Spark 1.4 there is also an explode that you can use directly from the select clause (much like in HiveQL): import org.apache.spark.sql.functions._ df.select(explode($entities.user_mentions).as(mention)) Unlike standard HiveQL, you can also include other attributes in the select or

Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Michael Armbrust
You can also do this using a sequence of case classes (in the example stored in a tuple, though the outer container could also be a case class): case class MyRecord(name: String, location: String) val df = Seq((1, Seq(MyRecord(Michael, Berkeley), MyRecord(Andy, Oakland.toDF(id, people)

Re: SparkSQL: leftOuterJoin is VERY slow!

2015-06-19 Thread Michael Armbrust
Broadcast outer joins are on my short list for 1.5. On Fri, Jun 19, 2015 at 10:48 AM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Hello, I have two RDDs: tv and sessions. I need to convert these DataFrames into RDDs because I need to use the groupByKey function. The reduceByKey

<    1   2   3   4   5   6   7   8   9   10   >