Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
Any ideas? :) From: Nathan nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Date: Wednesday, 7 January 2015 2:53 pm To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL schemaRDD MapPartitions calls

SparkSQL

2015-01-08 Thread Abhi Basu
I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark comes with SparkSQL by default. Also, will SparkSQL come enabled to access the Hive Metastore? Is there an easier way to enable Hive support without have to build the code with various switches? Thanks, Abhi -- Abhi

Re: SparkSQL

2015-01-08 Thread Marcelo Vanzin
Disclaimer: this seems more of a CDH question, I'd suggest sending these to the CDH mailing list in the future. CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but it does not include the thrift server because of incompatibilities with the CDH version of Hive. To use Hive support

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote: Hi,

Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian
This package is moved here: https://github.com/databricks/spark-avro On 1/6/15 5:12 AM, yanenli2 wrote: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
to concatenate column values in the query like col1+' +col3. For some reason, this issue is not manifesting itself when I do a single IF query. Is there a concat function in SparkSQL? I can't find anything in the documentation. Thanks, RK On Sunday, January 4, 2015 7:42 PM, RK prk

Re: SparkSQL support for reading Avro files

2015-01-08 Thread yanenli2
thanks for the reply! Now I know that this package is moved here: https://github.com/databricks/spark-avro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981p21040.html Sent from the Apache Spark User List

SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple example; load up some sample data from parquet on HDFS (about 380m rows, 10 columns) on a 7 node cluster. val t = sqlC.parquetFile(/user/n/sales

Implement customized Join for SparkSQL

2015-01-05 Thread Dai, Kevin
Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file while B is a database which indexed by id and I wrapped it by Data source API. The desired join flow is: 1. Generate A's RDD[Row] 2. Generate B's RDD[Row] from A by

RE: Implement customized Join for SparkSQL

2015-01-05 Thread Cheng, Hao
Can you paste the error log? From: Dai, Kevin [mailto:yun...@ebay.com] Sent: Monday, January 5, 2015 6:29 PM To: user@spark.apache.org Subject: Implement customized Join for SparkSQL Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id

Re: SparkSQL support for reading Avro files

2015-01-05 Thread Michael Armbrust
: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with the latent code of Spark version 1.2.0 or 1.2.1. I then try to pull a copy from github stated

SparkSQL support for reading Avro files

2015-01-05 Thread yanenli2
Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with the latent code of Spark version 1.2.0 or 1.2.1. I then try to pull a copy from github stated

Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
When I use a single IF statement like select IF(col1 != , col1+'$'+col3, col2+'$'+col3) from my_table, it works fine. However, when I use a nested IF like select IF(col1 != , col1+'$'+col3, IF(col2 != , col2+'$'+col3, '$')) from my_table, I am getting the following  exception. Exception in

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
BTW, I am seeing this issue in Spark 1.1.1. On Sunday, January 4, 2015 7:29 PM, RK prk...@yahoo.com.INVALID wrote: When I use a single IF statement like select IF(col1 != , col1+'$'+col3, col2+'$'+col3) from my_table, it works fine. However, when I use a nested IF like select IF(col1

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
The issue is happening when I try to concatenate column values in the query like col1+'$'+col3. For some reason, this issue is not manifesting itself when I do a single IF query. Is there a concat function in SparkSQL? I can't find anything in the documentation. Thanks,RK On Sunday

Re: SparkSQL 1.2.0 sources API error

2015-01-02 Thread Cheng Lian
Most of the time a NoSuchMethodError means wrong classpath settings, and some jar file is overriden by a wrong version. In your case it could be netty. On 1/3/15 1:36 PM, Niranda Perera wrote: Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a

SparkSQL 1.2.0 sources API error

2015-01-02 Thread Niranda Perera
Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a ava.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V error running the program. Error log: 15/01/03 10:41:30 ERROR ActorSystemImpl: Uncaught

Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
files. Does anyone know if something like this is supported, or whether this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html

Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Mickalas
.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands

Re: Mapping directory structure to columns in SparkSQL

2014-12-29 Thread Michael Armbrust
this is a reasonable thing to request? Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
...@gmail.com] *Sent:* Wednesday, December 24, 2014 4:26 AM *To:* user@spark.apache.org *Subject:* SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example:

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
@spark.apache.org Subject: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create

Re: SparkSQL 1.2.1-snapshot Left Join problem

2014-12-21 Thread Cheng Lian
Could you please file a JIRA together with the Git commit you're using? Thanks! On 12/18/14 2:32 AM, Hao Ren wrote: Hi, When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following query does not work: create table debug as select v1.* from t1 as v1 left join t2 as v2 on v1

SparkSQL 1.2.1-snapshot Left Join problem

2014-12-17 Thread Hao Ren
Hi, When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following query does not work: create table debug as select v1.* from t1 as v1 left join t2 as v2 on v1.sku = v2.sku where v2.sku is null Both t1 and t2 have 200 partitions. t1 has 10k rows, and t2 has 4k rows. this query

scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Hao Ren
Hi, I am using SparkSQL on 1.1.0 branch. The following code leads to a scala.MatchError at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) val scm = StructType(inputRDD.schema.fields.init :+ StructField(list, ArrayType( StructType

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.1.0 branch. The following

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread sahanbull
SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20364.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread Davies Liu
what to do about this. Hope you can help :) Many thanks SahanB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20364.html Sent from the Apache Spark User

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread sahanbull
Hi Guys, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a sparse vector and it can have keys

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread Davies Liu
, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a sparse vector and it can have keys

Using SparkSQL to query Hbase entity takes very long time

2014-12-02 Thread bonnahu
Hi all, I am new to Spark and currently I am trying to run a SparkSQL query on HBase entity. For an entity with about 4000 rows, it will take about 12 seconds. Is it expected? Is there any way to shorten the query process? Here is the code snippet: SparkConf sparkConf = new SparkConf

Re: advantages of SparkSQL?

2014-11-25 Thread mrm
Thank you for answering, this is all very helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Classpath issue: Custom authentication with sparkSQL/Spark 1.2

2014-11-25 Thread arin.g
Hi, I am trying to launch a spark 1.2 cluster with SparkSQL and custom authentication. After launching the cluster using the ec2 scripts, I copied the following hive-site.xml file into spark/conf dir: /configuration property namehive.server2.authentication/name valueCUSTOM/value /property

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
Hi, Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet support. I got the following exceptions: org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
/usr/lib/hive/lib doesn’t show any of the parquet jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as parquet-hive-1.0.jar Is it removed from latest Spark? Jianshi On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, Looks like the latest SparkSQL

advantages of SparkSQL?

2014-11-24 Thread mrm
Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised

Re: advantages of SparkSQL?

2014-11-24 Thread Akshat Aranya
criterion. Other than that, you would also get compression, and likely save processor cycles when parsing lines from text files. On Mon, Nov 24, 2014 at 8:20 AM, mrm ma...@skimlinks.com wrote: Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context

Re: advantages of SparkSQL?

2014-11-24 Thread Michael Armbrust
AM, mrm ma...@skimlinks.com wrote: Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc

Re: advantages of SparkSQL?

2014-11-24 Thread Cheng Lian
mailto:ma...@skimlinks.com wrote: Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Wang, Daoyuan
Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [mailto:ale.panebia...@me.com] Sent: Sunday, November 23, 2014 12:11 AM To: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure

Re: SparkSQL Timestamp query failure

2014-11-23 Thread Alessandro Panebianco
: SparkSQL Timestamp query failure Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Cheng, Hao
” is the keyword of data type in Hive/Spark SQL.) From: Alessandro Panebianco [mailto:ale.panebia...@me.com] Sent: Monday, November 24, 2014 11:12 AM To: Wang, Daoyuan Cc: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure Hey Daoyuan, following your suggestion I obtain

Re: SparkSQL Timestamp query failure

2014-11-23 Thread whitebread
=19613i=1 Subject: Re: SparkSQL Timestamp query failure Hey Daoyuan, following your suggestion I obtain the same result as when I do: where l.timestamp = '2012-10-08 16:10:36.0’ what happens using either your suggestion or simply using single quotes as I just typed

Re: SparkSQL Timestamp query failure

2014-11-22 Thread Akhil Das
) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL

Re: SparkSQL Timestamp query failure

2014-11-22 Thread whitebread
, Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html Sent from the Apache Spark User List mailing list archive at Nabble.com

SparkSQL Timestamp query failure

2014-11-21 Thread whitebread
) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Evan Chan
I would expect an SQL query on c would fail because c would not be known in the schema of the older Parquet file. What I'd be very interested in is how to add a new column as an incremental new parquet file, and be able to somehow join the existing and new file, in an efficient way. IE, somehow

SparkSQL exception handling

2014-11-20 Thread Daniel Haviv
Hi, I'm loading a bunch of json files and there seems to be problems with specific files (either schema changes or incomplete files). I'd like to catch the inconsistent files but I'm not sure how to do it. This is the exception I get: 14/11/20 00:13:49 INFO cluster.YarnClientClusterScheduler:

Re: SparkSQL exception handling

2014-11-20 Thread Daniel Haviv
Update: I tried surrounding the problematic code with try and catch but that does not do the trick: try { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val jsonFiles=sqlContext.jsonFile(/requests.loading) } catch { case _: Throwable = // Catching all exceptions and

Re: NEW to spark and sparksql

2014-11-20 Thread Sam Flint
as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each file? or using SparkSQL a separate SchemaRDD? I want to be able to pull lets say

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
went wrong while scanning this LZO compressed Parquet file. But unfortunately the stack trace at hand doesn’t indicate the root cause. Cheng On 11/15/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Michael Armbrust
wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
wrong while scanning this LZO compressed Parquet file. But unfortunately the stack trace at hand doesn’t indicate the root cause. Cheng On 11/15/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I

Re: NEW to spark and sparksql

2014-11-20 Thread Michael Armbrust
to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each file? or using

Re: SparkSQL exception on cached parquet table

2014-11-20 Thread Sadhan Sood
scanning this LZO compressed Parquet file. But unfortunately the stack trace at hand doesn’t indicate the root cause. Cheng On 11/15/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered

Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf nightwolf...@gmail.com wrote: Is there a better way to mock this out and test Hive/metastore with SparkSQL? I would use TestHive which creates a fresh metastore each time it is invoked.

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
find it here: https://github.com/databricks/spark-avro Bug reports welcome! Michael On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint sam.fl...@magnetic.com wrote: Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust
, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Eric Zhen
wrote: Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL

SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf
Hi, Just to give some context. We are using Hive metastore with csv Parquet files as a part of our ETL pipeline. We query these with SparkSQL to do some down stream work. I'm curious whats the best way to go about testing Hive SparkSQL? I'm using 1.1.0 I see that the LocalHiveContext has been

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian
On 11/15/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Sadhan Sood
/15/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian
testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; val

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng, Thanks for your response.Here is the stack trace from yarn logs: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html Sent from the Apache Spark User List mailing list archive

SparkSQL exception on spark.sql.codegen

2014-11-15 Thread Eric Zhen
Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92

SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
Thanks Cheng, that was helpful. I noticed from UI that only half of the memory per executor was being used for caching, is that true? We have a 2 TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB memory but caching still failed and what looked like from the UI was that it

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of memory is used for caching. You may refer to details from here http://spark.apache.org/docs/latest/configuration.html On 11/15/14 5:43 AM, Sadhan Sood wrote: Thanks Cheng, that was helpful. I noticed from UI that only half

Re: loading, querying schemaRDD using SparkSQL

2014-11-13 Thread vdiwakar.malladi
on this? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18841.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com wrote: Currently there’s

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master and branch-1.2 is 10,000 rows per batch. On 11/14/14 1:27 AM, Sadhan Sood wrote: Thanks Chneg, Just

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory 1TB and when trying to cache a table partition(which is 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.deprecation

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
On re running the cache statement, from the logs I see that when collect(stage 1) fails it always leads to mapPartition(stage 0) for one partition to be re-run. This can be seen from the collect log as well on the container log: rg.apache.spark.shuffle.MetadataFetchFailedException: Missing an

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting

Re: loading, querying schemaRDD using SparkSQL

2014-11-06 Thread Michael Armbrust
-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu terry@smartfocus.com wrote: I’m trying to execute a subquery inside an IN clause and am encountering an unsupported language feature

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
/loading-querying-schemaRDD-using-SparkSQL-tp18052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL

Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Does SparkSQL work with custom defined SerDe? Looks like it may be related to https://issues.apache.org/jira/browse/SPARK-3807. I will build from branch 1.1 to see if the issue is resolved. Chen On Tue, Oct 14

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
. Cheng On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table. Basically, I have a table mtable partitioned by some date

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Jean-Pascal Billaud
to collect column statistics, which causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP. Cheng On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hi, While testing SparkSQL on top of our Hive metastore, I am getting some

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table. Basically, I have a table mtable partitioned by some date field in hive and below

Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required

<    3   4   5   6   7   8   9   10   11   >