Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
You can create a partitioned hive table using Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote: Hi, I've got a bunch of data stored in S3 under directories like this:

Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Michael Armbrust
It seems likely that there is some sort of bug related to the reuse of array objects that are returned by UDFs. Can you open a JIRA? I'll also note that the sql method on HiveContext does run HiveQL (configured by spark.sql.dialect) and the hql method has been deprecated since 1.1 (and will

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Let me clarify, you do not need to have Hive installed, and what I'm suggesting is completely self-contained in Spark SQL. We support

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Michael Armbrust
I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-25 Thread Michael Armbrust
. So the HQL dialect provided by HiveContext, does it use catalyst optimizer? I though HiveContext is only related to Hive integration in Spark! Would be grateful if you could clarify this cheers On Sun, Jan 25, 2015 at 1:23 AM, Michael Armbrust mich...@databricks.com wrote: I generally

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-24 Thread Michael Armbrust
I generally recommend people use the HQL dialect provided by the HiveContext when possible: http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started I'll also note that this is distinct from the Hive on Spark project, which is based on the Hive query optimizer / execution

Re: Support for SQL on unions of tables (merge tables?)

2015-01-24 Thread Michael Armbrust
I have never used Hive, so I'll have to investigate further. To clarify, I wasn't recommending you use Apache Hive, but instead the HiveContext http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started provided by Spark SQL. This will allow you to create views in a hive

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-24 Thread Michael Armbrust
Those annotations actually don't work because the timestamp is SQL has optional nano-second precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com wrote:

Re: Can't access nested types with sql

2015-01-24 Thread Michael Armbrust
You need to use lateral view explode: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView On Fri, Jan 23, 2015 at 7:02 AM, matthes mdiekst...@sensenetworks.com wrote: I try to work with nested parquet data. To read and write the parquet file is actually working now but

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Michael Armbrust
1) The fields in the SELECT clause are not pushed down to the predicate pushdown API. I have many optimizations that allow fields to be filtered out before the resulting object is serialized on the Accumulo tablet server. How can I get the selection information from the execution plan? I'm a

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Michael Armbrust
, Jan 17, 2015 at 3:38 PM, Michael Armbrust mich...@databricks.com wrote: 1) The fields in the SELECT clause are not pushed down to the predicate pushdown API. I have many optimizations that allow fields to be filtered out before the resulting object is serialized on the Accumulo tablet server

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-16 Thread Michael Armbrust
: Monday, 12 January 2015 1:21 am To: Nathan nathan.mccar...@quantium.com.au, Michael Armbrust mich...@databricks.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats? On 1/11/15 1:40 PM, Nathan McCarthy

Re: Using Spark SQL with multiple (avro) files

2015-01-16 Thread Michael Armbrust
I'd open an issue on the github to ask us to allow you to use hadoops glob file format for the path. On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com wrote: I've tried this now. Spark can load multiple avro files from the same directory by passing a path to a directory.

Re: Parquet compression codecs not applied

2015-01-09 Thread Michael Armbrust
This is a little confusing, but that code path is actually going through hive. So the spark sql configuration does not help. Perhaps, try: set parquet.compression=GZIP; On Fri, Jan 9, 2015 at 2:41 AM, Ayoub benali.ayoub.i...@gmail.com wrote: Hello, I tried to save a table created via the

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Michael Armbrust
The other thing to note here is that Spark SQL defensively copies rows when we switch into user code. This probably explains the difference between 1 2. The difference between 1 3 is likely the cost of decompressing the column buffers vs. accessing a bunch of uncompressed primitive objects.

Re: Spark with Hive cluster dependencies

2015-01-07 Thread Michael Armbrust
Have you looked at Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables? It supports HiveQL, can read from the hive metastore, and does not require hadoop. On Wed, Jan 7, 2015 at 8:27 AM, jamborta jambo...@gmail.com wrote: Hi all, We have been building a system

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Michael Armbrust
The cache command caches the entire table, with each column stored in its own byte buffer. When querying the data, only the columns that you are asking for are scanned in memory. I'm not sure what mechanism spark is using to report the amount of data read. If you want to read only the data that

Re: Parquet schema changes

2015-01-06 Thread Michael Armbrust
I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore dragoncu...@gmail.com wrote: Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
The types expected by applySchema are documented in the type reference section: http://spark.apache.org/docs/latest/sql-programming-guide.html#spark-sql-datatype-reference I'd certainly accept a PR to improve the docs and add a link to this from the applySchema section :) Can you explain why you

Re: Add StructType column to SchemaRDD

2015-01-05 Thread Michael Armbrust
Oh sorry, I'm rereading your email more carefully. Its only because you have some setup code that you want to amortize? On Mon, Jan 5, 2015 at 10:40 PM, Michael Armbrust mich...@databricks.com wrote: The types expected by applySchema are documented in the type reference section: http

Re: SparkSQL support for reading Avro files

2015-01-05 Thread Michael Armbrust
Did you follow the link on that page? THIS REPO HAS BEEN MOVED https://github.com/marmbrus/sql-avro#please-go-to-the-version-hosted-by-databricksPlease go to the version hosted by databricks https://github.com/databricks/spark-avro On Mon, Jan 5, 2015 at 1:12 PM, yanenli2 yane...@gmail.com

Re: Parquet predicate pushdown

2015-01-05 Thread Michael Armbrust
Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want:

Re: JdbcRdd for Python

2015-01-05 Thread Michael Armbrust
I'll add that there is a JDBC connector for the Spark SQL data sources API in the works, and this will work with python (though the standard SchemaRDD type conversions). On Mon, Jan 5, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote: JavaDataBaseConnectivity is, as far as I know, JVM

Re: spark 1.2: value toJSON is not a member of org.apache.spark.sql.SchemaRDD

2015-01-05 Thread Michael Armbrust
I think you are missing something: $ javap -cp ~/Downloads/spark-sql_2.10-1.2.0.jar org.apache.spark.sql.SchemaRDD|grep toJSON public org.apache.spark.rdd.RDDjava.lang.String toJSON(); On Mon, Jan 5, 2015 at 3:11 AM, bchazalet bchaza...@companywatch.net wrote: Hi everyone, I have just

Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-31 Thread Michael Armbrust
-protobuf-2.5.jar is not generated from Spark source code, right ? What would be done after the JIRA is opened ? Cheers On Wed, Dec 31, 2014 at 12:16 PM, Michael Armbrust mich...@databricks.com wrote: This was not intended, can you open a JIRA? On Tue, Dec 30, 2014 at 8:40 PM, Ted Yu

Re: Spark SQL implementation error

2014-12-30 Thread Michael Armbrust
Anytime you see java.lang.NoSuchMethodError it means that you have multiple conflicting versions of a library on the classpath, or you are trying to run code that was compiled against the wrong version of a library. On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com wrote: I

Re: Compile error from Spark 1.2.0

2014-12-29 Thread Michael Armbrust
Yeah, this looks like a regression in the API due to the addition of arbitrary decimal support. Can you open a JIRA? On Sun, Dec 28, 2014 at 12:23 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Zigen, Looks like they missed it. Thanks Best Regards On Sat, Dec 27, 2014 at 12:43 PM,

Re: A question about using insert into in rdd foreach in spark 1.2

2014-12-29 Thread Michael Armbrust
-dev +user In general you cannot create new RDDs inside closures that run on the executors (which is what sql inside of a foreach is doing). I think what you want here is something like: sqlContext.parquetFile(Data\\Test\\Parquet\\2).registerTempTable(temp2) sql(SELECT col1, col2 FROM

Re: Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Michael Armbrust
You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ) I would like to make this easier as part of improvements to the datasources api that we are

Re: How to set up spark sql on ec2

2014-12-29 Thread Michael Armbrust
I would expect this to work. Can you run the standard spark-shell? On Mon, Dec 29, 2014 at 2:34 AM, critikaled isasmani@gmail.com wrote: How to make the spark ec2 script to install hive and spark sql on ec2 when I run the spark ec2 script and go to bin and run ./spark-sql and execute

Re: SchemaRDD to RDD[String]

2014-12-24 Thread Michael Armbrust
You might also try the following, which I think is equivalent: schemaRDD.map(_.mkString(,)) On Wed, Dec 24, 2014 at 8:12 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: I want to convert a schemaRDD into RDD

Re: Escape commas in file names

2014-12-24 Thread Michael Armbrust
No, there is not. Can you open a JIRA? On Tue, Dec 23, 2014 at 6:33 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a

Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Michael Armbrust
The various spark contexts generally aren't serializable because you can't use them on the executors anyway. We made SQLContext serializable just because it gets pulled into scope more often due to the implicit conversions its contains. You should try marking the variable that holds the context

Re: hiveContext.jsonFile fails with Unexpected close marker

2014-12-24 Thread Michael Armbrust
Each JSON object needs to be on a single line since this is the boundary the TextFileInputFormat uses when splitting up large files. On Wed, Dec 24, 2014 at 12:34 PM, elliott cordo elliottco...@gmail.com wrote: I have generally been impressed with the way jsonFile eats just about any json data

Re: Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Michael Armbrust
I would expect that killing a stage would kill the whole job. Are you not seeing that happen? On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Hello everyone! Like the title. I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to execute SQL.

Re: java.sql.SQLException: No suitable driver found

2014-12-21 Thread Michael Armbrust
With JDBC you often need to load the class so it can register the driver at the beginning of your program. Usually this is something like: Class.forName(com.mysql.jdbc.Driver); On Fri, Dec 19, 2014 at 3:47 PM, durga durgak...@gmail.com wrote: Hi I am facing an issue with mysql jars with

Re: Querying Temp table using JDBC

2014-12-19 Thread Michael Armbrust
This is experimental, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On Fri, Dec 19, 2014 at 6:04

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-19 Thread Michael Armbrust
a separate table for caching some partitions like 'cache table table_cached as select * from table where date = 201412XX' - the way we are doing right now. On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com wrote: There is only column level encoding (run length

Re: When will Spark SQL support building DB index natively?

2014-12-18 Thread Michael Armbrust
with partition firstly? On Thursday, December 18, 2014 2:28 AM, Michael Armbrust mich...@databricks.com wrote: - Dev list Have you looked at partitioned table support? That would only scan data where the predicate matches the partition. Depending on the cardinality of the customerId

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Michael Armbrust
There is only column level encoding (run length encoding, delta encoding, dictionary encoding) and no generic compression. On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
You can create an RDD[String] using whatever method and pass that to jsonRDD. On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote: Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because

Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Michael Armbrust
- Dev list Have you looked at partitioned table support? That would only scan data where the predicate matches the partition. Depending on the cardinality of the customerId column that could be a good option for you. On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
because it is saving one stage. Did I do something wrong? Best Regards, Jerry On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com wrote: You can create an RDD[String] using whatever method and pass that to jsonRDD. On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
To be a little more clear jsonRDD and jsonFile use the same implementation underneath. jsonFile is just a connivence method that does jsonRDD(sc.textFile(...)) On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com wrote: The first pass is inferring the schema of the JSON

Re: JSON Input files

2014-12-15 Thread Michael Armbrust
Underneath the covers, jsonFile uses TextInputFormat, which will split files correctly based on new lines. Thus, there is no fixed maximum size for a json object (other than the fact that it must fit into memory on the executors). On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar

Re: Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-15 Thread Michael Armbrust
Can you add this information to the JIRA? On Mon, Dec 15, 2014 at 10:54 AM, shenghua wansheng...@gmail.com wrote: Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version

Re: Intermittent test failures

2014-12-15 Thread Michael Armbrust
Is it possible that you are starting more than one SparkContext in a single JVM with out stopping previous ones? I'd try testing with Spark 1.2, which will throw an exception in this case. On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier mps@gmail.com wrote: Hi, I’m seeing strange, random

Re: Spark-SQL JDBC driver

2014-12-14 Thread Michael Armbrust
I'll add that there is an experimental method that allows you to start the JDBC server with an existing HiveContext (which might have registered temporary tables).

Re: SchemaRDD partition on specific column values?

2014-12-14 Thread Michael Armbrust
I'm happy to discuss what it would take to make sure we can propagate this information correctly. Please open a JIRA (and mention me in it). Regarding including it in 1.2.1, it depends on how invasive the change ends up being, but it is certainly possible. On Thu, Dec 11, 2014 at 3:55 AM, nitin

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Michael Armbrust
BTW, I cannot use SparkSQL / case right now because my table has 200 columns (and I'm on Scala 2.10.3) You can still apply the schema programmatically: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Re: Trouble with cache() and parquet

2014-12-14 Thread Michael Armbrust
a little better why the cache call trips this scenario On Wed, Dec 10, 2014 at 3:50 PM, Michael Armbrust mich...@databricks.com wrote: Have you checked to make sure the schema in the metastore matches the schema in the parquet file? One way to test would be to just use

Re: PhysicalRDD problem?

2014-12-10 Thread Michael Armbrust
can take this up as a task for myself if you want (since this is very crucial for our release). Thanks -Nitin On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com wrote: val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema

Re: Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Michael Armbrust
Yep, because sc.textFile will only guarantee that lines will be preserved across splits, this is the semantic. It would be possible to write a custom input format, but that hasn't been done yet. From the documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Re: Trouble with cache() and parquet

2014-12-10 Thread Michael Armbrust
Have you checked to make sure the schema in the metastore matches the schema in the parquet file? One way to test would be to just use sqlContext.parquetFile(...) which infers the schema from the file instead of using the metastore. On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska

Re: spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-12-09 Thread Michael Armbrust
Not yet unfortunately. You could cast the timestamp to a long if you don't need nanosecond precision. Here is a related JIRA: https://issues.apache.org/jira/browse/SPARK-4768 On Mon, Dec 8, 2014 at 11:27 PM, ZHENG, Xu-dong dong...@gmail.com wrote: I meet the same issue. Any solution? On

Re: PhysicalRDD problem?

2014-12-09 Thread Michael Armbrust
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema) This line is throwing away the logical information about existingSchemaRDD and thus Spark SQL can't know how to push down projections or predicates past this operator. Can you describe more the problems

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-09 Thread Michael Armbrust
is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com wrote: This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata

Re: Can HiveContext be used without using Hive?

2014-12-09 Thread Michael Armbrust
That is correct. It the hive context will create an embedded metastore in the current directory if you have not configured hive. On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com wrote: From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust
This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi

Re: Is there a way to get column names using hiveContext ?

2014-12-08 Thread Michael Armbrust
You can call .schema on SchemaRDDs. For example: results.schema.fields.map(_.name) On Sun, Dec 7, 2014 at 11:36 PM, abhishek reachabhishe...@gmail.com wrote: Hi, I have iplRDD which is a json, and I do below steps and query through hivecontext. I get the results but without columns

Re: Including data nucleus tools

2014-12-06 Thread Michael Armbrust
On Sat, Dec 6, 2014 at 5:53 AM, spark.dubovsky.ja...@seznam.cz wrote: Bonus question: Should the class org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly? Because it is not in jar now. No these jars cannot be put into the assembly because they have extra metadata files

Re: run JavaAPISuite with mavem

2014-12-06 Thread Michael Armbrust
Not sure about maven, but you can run that test with sbt: sbt/sbt sql/test-only org.apache.spark.sql.api.java.JavaAPISuite On Sat, Dec 6, 2014 at 9:59 PM, Ted Yu yuzhih...@gmail.com wrote: I tried to run tests for core but there were failures. e.g. : ^[[32mExternalAppendOnlyMapSuite:^[[0m

Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got exception saying Hive: NoSuchObjectException(message:table

Re: SchemaRDD partition on specific column values?

2014-12-05 Thread Michael Armbrust
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work. On Thu, Dec 4, 2014 at 8:42 PM, nitin nitin2go...@gmail.com wrote: With some quick googling, I learnt that I can we can provide

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.1.0 branch. The following

Re: Spark SQL with a sorted file

2014-12-04 Thread Michael Armbrust
I'll add that some of our data formats will actual infer this sort of useful information automatically. Both parquet and cached inmemory tables keep statistics on the min/max value for each column. When you have predicates over these sorted columns, partitions will be eliminated if they can't

Re: How to make symbol for one column in Spark SQL.

2014-12-04 Thread Michael Armbrust
You need to import sqlContext._ On Thu, Dec 4, 2014 at 2:26 PM, Tim Chou timchou@gmail.com wrote: I have tried to use function where and filter in SchemaRDD. I have build class for tuple/record in the table like this: case class Region(num:Int, str1:String, str2:String) I also

Re: [SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Michael Armbrust
It won't work until this is merged: https://github.com/apache/spark/pull/3407 On Wed, Dec 3, 2014 at 9:25 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I'm wondering if someone has successfully used wildcards with a parquetFile call? I saw this thread and it makes me think no?

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Michael Armbrust
There is an experimental method that allows you to start the JDBC server with an existing HiveContext (which might have registered temporary tables). https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

Re: Unresolved attributes

2014-12-02 Thread Michael Armbrust
A little bit about how to read this output. Resolution occurs from the bottom up and when you see a tick (') it means that a field is unresolved. So here it looks like Year_2011_Month_0_Week_0_Site is missing from from your RDD. (We are working on more obvious error messages). Michael On Tue,

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com wrote: You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how

Re: Creating a SchemaRDD from an existing API

2014-11-28 Thread Michael Armbrust
You probably don't need to create a new kind of SchemaRDD. Instead I'd suggest taking a look at the data sources API that we are adding in Spark 1.2. There is not a ton of documentation, but the test cases show how to implement the various interfaces

Re: SchemaRDD compute function

2014-11-26 Thread Michael Armbrust
Exactly how the query is executed actually depends on a couple of factors as we do a bunch of optimizations based on the top physical operator and the final RDD operation that is performed. In general the compute function is only used when you are doing SQL followed by other RDD operations (map,

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Michael Armbrust
In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is a file in S3 and then flatMap that with something that reads the files and returns the lines. Here's an example:

Re: can't get smallint field from hive on spark

2014-11-26 Thread Michael Armbrust
This has been fixed in Spark 1.1.1 and Spark 1.2 https://issues.apache.org/jira/browse/SPARK-3704 On Wed, Nov 26, 2014 at 7:10 PM, 诺铁 noty...@gmail.com wrote: hi, don't know whether this question should be asked here, if not, please point me out, thanks. we are currently using hive on

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
repartition and coalesce should both allow you to achieve what you describe. Can you maybe share the code that is not working? On Mon, Nov 24, 2014 at 8:24 PM, tridib tridib.sama...@live.com wrote: Hello, I am reading around 1000 input files from disk in an RDD and generating parquet. It

Re: Merging Parquet Files

2014-11-25 Thread Michael Armbrust
:30 PM, Michael Armbrust mich...@databricks.com wrote: Parquet does a lot of serial metadata operations on the driver which makes it really slow when you have a very large number of files (especially if you are reading from something like S3). This is something we are aware of and that I'd

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
? Thanks again! Daniel On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote: Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM

Re: Spark sql UDF for array aggergation

2014-11-25 Thread Michael Armbrust
We don't support native UDAs at the moment in Spark SQL. You can write a UDA using Hive's API and use that within Spark SQL On Tue, Nov 25, 2014 at 10:10 AM, Barua, Seemanto seemanto.ba...@jpmchase.com.invalid wrote: Hi, I am looking for some resources/tutorials that will help me achive

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
RDDs are immutable, so calling coalesce doesn't actually change the RDD but instead returns a new RDD that has fewer partitions. You need to save that to a variable and call saveAsParquetFile on the new RDD. On Tue, Nov 25, 2014 at 10:07 AM, tridib tridib.sama...@live.com wrote: public

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
I believe coalesce(..., true) and repartition are the same. If the input files are of similar sizes, then coalesce will be cheaper as it introduces a narrow dependency https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf, meaning there won't be a shuffle. However, if there

Re: How does Spark SQL traverse the physical tree?

2014-11-24 Thread Michael Armbrust
You are pretty close. The QueryExecution is what drives the phases from parsing to execution. Once we have a final SparkPlan (the physical plan), toRdd just calls execute() which recursively calls execute() on children until we hit a leaf operator. This gives us an RDD[Row] that will compute

Re: advantages of SparkSQL?

2014-11-24 Thread Michael Armbrust
Akshat is correct about the benefits of parquet as a columnar format, but I'll add that some of this is lost if you just use a lambda function to process the data. Since your lambda function is a black box Spark SQL does not know which columns it is going to use and thus will do a full tablescan.

Re: How can I read this avro file using spark scala?

2014-11-24 Thread Michael Armbrust
I have to try to battle-test spark-avro. Thanks On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust mich...@databricks.com wrote: One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-24 Thread Michael Armbrust
Can you give the full stack trace. You might be hitting: https://issues.apache.org/jira/browse/SPARK-4293 On Sun, Nov 23, 2014 at 3:00 PM, critikaled isasmani@gmail.com wrote: Hi, I am trying to insert particular set of data from rdd to a hive table I have Map[String,Map[String,Int]] in

Re: Merging Parquet Files

2014-11-24 Thread Michael Armbrust
Parquet does a lot of serial metadata operations on the driver which makes it really slow when you have a very large number of files (especially if you are reading from something like S3). This is something we are aware of and that I'd really like to improve in 1.3. You might try the (brand new

Re: How can I read this avro file using spark scala?

2014-11-20 Thread Michael Armbrust
One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. This is very new, but we would love to get feedback: https://github.com/databricks/spark-avro On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote: I've read several

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Michael Armbrust
Which version are you running on again? On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com wrote: Also attaching the parquet file if anyone wants to take a further look. On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com wrote: So, I am seeing this issue

Re: Spark SQL Exception handling

2014-11-20 Thread Michael Armbrust
If you run master or the 1.2 preview release then it should automatically skip lines that fail to parse. The corrupted text will be in the column _corrupted_record and the other columns will be null. On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com wrote: Hi Guys, I really

Re: NEW to spark and sparksql

2014-11-20 Thread Michael Armbrust
am wrong. Thanks On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust mich...@databricks.com wrote: I would use just textFile unless you actually need a guarantee that you will be seeing a whole file at time (textFile splits on new lines). RDDs are immutable, so you cannot add data to them

Re: Best way to store RDD data?

2014-11-20 Thread Michael Armbrust
Which SchemaRDD you can save out case classes to parquet (or JSON in Spark 1.2) automatically and when you read it back in the structure will be preserved. However, you won't get case classes when its loaded back, instead you'll get rows that you can query. There is some experimental support for

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Michael Armbrust
Looks like intelij might be trying to load the wrong version of spark? On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys I am at AmpCamp 2014 at UCB right now :-) Funny Issue... This code works in Spark-Shell but throws a funny

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Michael Armbrust
This is not by design. Can you please file a JIRA? On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I am running HiveThriftServer2 and noticed that the process stays up even though there is no driver connected to the Spark master. I started the server

Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf nightwolf...@gmail.com wrote: Is there a better way to mock this out and test Hive/metastore with SparkSQL? I would use TestHive which creates a fresh metastore each time it is invoked.

Re: Merging Parquet Files

2014-11-19 Thread Michael Armbrust
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com wrote: Another problem I have is that I get a lot of small json files and as a result a lot of small parquet files, I'd like to merge the json files into a few parquet files.. how I do that? You can use `coalesce` on any

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Michael Armbrust
I am not very familiar with the JSONSerDe for Hive, but in general you should not need to manually create a schema for data that is loaded from hive. You should just be able to call saveAsParquetFile on any SchemaRDD that is returned from hctx.sql(...). I'd also suggest you check out the

Re: tableau spark sql cassandra

2014-11-19 Thread Michael Armbrust
The whole stacktrack/exception would be helpful. Hive is an optional dependency of Spark SQL, but you will need to include it if you are planning to use the thrift server to connect to Tableau. You can enable it by add -Phive when you build Spark. You might also try asking on the cassandra

Re: Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Michael Armbrust
That error can mean a whole bunch of things (and we've been working in recently to make it more descriptive). Often the actual cause is in the executor logs. On Wed, Nov 19, 2014 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: Has anyone else received this type of error? We are not sure

Re: Converting a json struct to map

2014-11-19 Thread Michael Armbrust
You can override the schema inference by passing a schema as the second argument to jsonRDD, however thats not a super elegant solution. We are considering one option to make this easier here: https://issues.apache.org/jira/browse/SPARK-4476 On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das

<    3   4   5   6   7   8   9   10   11   >