Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
amming-guide.html >> >> For Avro in particular, I have been working on a library for Spark SQL. >> Its very early code, but you can find it here: >> https://github.com/databricks/spark-avro >> >> Bug reports welcome! >> >> Michael >> >> On

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
RDD > files as well as SparkSQL. My question is more on how to build out the RDD > files and best practices. I have data that is broken down by hour into > files on HDFS in avro format. Do I need to create a separate RDD for each > file? or using SparkSQL a separate SchemaRDD? >

NEW to spark and sparksql

2014-11-19 Thread Sam Flint
Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for

Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf wrote: > > Is there a better way to mock this out and test Hive/metastore with > SparkSQL? > I would use TestHive which creates a fresh metastore each time it is invoked.

SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf
Hi, Just to give some context. We are using Hive metastore with csv & Parquet files as a part of our ETL pipeline. We query these with SparkSQL to do some down stream work. I'm curious whats the best way to go about testing Hive & SparkSQL? I'm using 1.1.0 I see that the Lo

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Eric Zhen
nd not much has changed there. Is the error >>> deterministic? >>> >>> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen wrote: >>> >>>> Hi Michael, >>>> >>>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. >>&g

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust
nistic? >> >> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen wrote: >> >>> Hi Michael, >>> >>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. >>> >>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust < >>> mich...@dat

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
17, 2014 at 7:04 PM, Eric Zhen wrote: > >> Hi Michael, >> >> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. >> >> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust > > wrote: >> >>> What version of Spark SQL? >>> >>&g

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust > wrote: > >> What version of Spark SQL? >> >> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote: >> >>> Hi all, >>> >>> We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codege

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust wrote: > What version of Spark SQL? > > On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote: > >> Hi all, >> >> We run SparkS

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote: > Hi all, > > We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we > got exceptions as below, has anyone else saw these before? > > java.lang.ExceptionInInitializer

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Sadhan Sood
; to give COUNT(*). In the second case, however, the whole table is asked >>> to be cached lazily via the cacheTable call, thus it’s scanned to build >>> the in-memory columnar cache. Then thing went wrong while scanning this LZO >>> compressed Parquet file. But unfortunate

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian
/14 5:28 AM, Sadhan Sood wrote: While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop

SparkSQL exception on spark.sql.codegen

2014-11-15 Thread Eric Zhen
Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92) at

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng, Thanks for your response.Here is the stack trace from yarn logs: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html Sent from the Apache Spark User List mailing list archive at

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian
testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; val

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of memory is used for caching. You may refer to details from here http://spark.apache.org/docs/latest/configuration.html On 11/15/14 5:43 AM, Sadhan Sood wrote: Thanks Cheng, that was helpful. I noticed from UI that only half o

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
Thanks Cheng, that was helpful. I noticed from UI that only half of the memory per executor was being used for caching, is that true? We have a 2 TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB memory but caching still failed and what looked like from the UI was that it u

SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master and branch-1.2 is 10,000 rows per batch. On 11/14/14 1:27 AM, Sadhan Sood wrote: Thanks Chneg, Just one

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian wrote: > Currently there’s no way to cache the c

Re: loading, querying schemaRDD using SparkSQL

2014-11-13 Thread vdiwakar.malladi
? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18841.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting |spark.sql.inMemo

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
On re running the cache statement, from the logs I see that when collect(stage 1) fails it always leads to mapPartition(stage 0) for one partition to be re-run. This can be seen from the collect log as well on the container log: rg.apache.spark.shuffle.MetadataFetchFailedException: Missing an outp

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.deprecation (Configuration.java:warn

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory > 1TB and when trying to cache a table partition(which is < 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than en

Re: loading, querying schemaRDD using SparkSQL

2014-11-06 Thread Michael Armbrust
ntext: > http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscrib

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For addit

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
e.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL - No support for subqueries in 1.2-snapshot? This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 a

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
t; View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - &

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu wrote: > I’m trying to execute a subquery inside an IN clause and am encountering > an unsupported language feature in the parser. > > java

SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Terry Siu
I’m trying to execute a subquery inside an IN clause and am encountering an unsupported language feature in the parser. java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (

loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
context: http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
performs really > well once you tune it properly. > > As far I understand SparkSQL under the hood performs many of these > optimizations (order of Spark operations) and uses a more efficient storage > format. Is this assumption correct? > > Has anyone done any comparison

Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
ark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: Does SparkSQL work with custom defined SerDe? Looks like it may be related to https://issues.apache.org/jira/browse/SPARK-3807. I will build from branch 1.1 to see if the issue is resolved. Chen On Tue, Oct 14,

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
gt; On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud > wrote: > >> Hi, >> >> While testing SparkSQL on top of our Hive metastore, I am getting >> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD >> table. >> >> Basically, I ha

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Jean-Pascal Billaud
sed to collect column statistics, which causes this > issue. Filed SPARK-4182 to track this issue, will fix this ASAP. > > Cheng > >> On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud >> wrote: >> Hi, >> >> While t

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
. Cheng On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud wrote: > Hi, > > While testing SparkSQL on top of our Hive metastore, I am getting > some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD > table. > > Basically, I have a table "mtable" part

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone

Re: SparkSQL performance

2014-10-31 Thread Du Li
From: Soumya Simanta mailto:soumya.sima...@gmail.com>> Date: Friday, October 31, 2014 at 4:04 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: SparkSQL performance I was really surprised to see the results here, e

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL "not completing" http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that ar

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
, > > "org.clapper" %% "grizzled-slf4j" % "1.0.2", > > "log4j" % "log4j" % "1.2.17" > > On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson < > helena.edel...@datastax.com> wrote: > >> Hi Shahab, &g

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
", > > "org.slf4j" % "slf4j-simple" % "1.7.7", > > "org.clapper" %% "grizzled-slf4j" % "1.0.2", > > "log4j" % "log4j" % "1.2.17" > > > On Fri, Oct 31, 2014 at

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Edelson wrote: > Hi Shahab, > > I’m just curious, are you explicitly needing to use thrift? Just using the > connector with spark does not require any thrift dependencies. > Simply: "com.datastax.spark" %% "spark-cassandra-connector" % > "1.1

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
: > Hi, > > I am using the latest Cassandra-Spark Connector to access Cassandra tables > form Spark. While I successfully managed to connect Cassandra using > CassandraRDD, the similar SparkSQL approach does not work. Here is my code > for both methods: > > import com.datasta

Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import

Re: SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Michael Armbrust
Hmmm, this looks like a bug. Can you file a JIRA? On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud wrote: > Hi, > > While testing SparkSQL on top of our Hive metastore, I am getting > some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD > table. > >

SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Jean-Pascal Billaud
Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table. Basically, I have a table "mtable" partitioned by some "date" field in hive and below is the scala code I am running

Re: SparkSQL: Nested Query error

2014-10-29 Thread SK
tract(deviceRDD).count(). The count comes out to be 1, but there are many UIDs in "tusers" that are not in "device" - so the result is not correct. I would like to know the right way to do frame this query in SparkSQL. thanks -- View this message in context: http://apache-s

Re: SparkSQL: Nested Query error

2014-10-29 Thread Sanjiv Mittal
as a string literal as follows: > val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers > WHERE tusers.u_uid NOT IN ("SELECT d_uid FROM device")") > But that resulted in a compilation error. > > What is the right way to frame the above query in

SparkSQL: Nested Query error

2014-10-29 Thread SK
frame the above query in Spark SQL? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-Query-error-tp17691.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled
me_key where value operator 'some_thing' ". BTW what do you mean by "extract" could you direct me to api or code sample. thanks and regards, critikaled. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL

Re: Re: SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
It works, thanks very much Zhanfeng Huo From: Yanbo Liang Date: 2014-10-28 18:50 To: Zhanfeng Huo CC: user Subject: Re: SparkSql OutOfMemoryError Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12

Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang
Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo : > Hi,friends: > > I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails > when data is large. So how to tune it ? > > spark-defaults.conf: > > spark.shuffle.consolidateFiles true > spark.shuffle

SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true spark.shuffle.manager SORT spark.akka.threads 4 spark.sql.inMemoryColumnarStorage.compressed true

Re: 答复: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
PATH '/home/data/testFolder/qrytblA.txt' INTO TABLE tblA; LOAD DATA LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE tblB; *发件人:*Cheng Lian [mailto:lian.cs@gmail.com] *发 送时间:*2014年10月27日16:48 *收件人:*lyf刘钰帆; user@spark.apache.org *主题:*Re: SparkSQL display

Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are these tables stored in? Is this issue specific to this query? I guess Hive, Shark and Spark SQL all read from the same HDFS dataset? On 10/27/14 3:45 PM, lyf刘钰帆 wrote: Hi, I am using SparkSQL 1.1.0 with cdh 4.6.0 recently

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
Hive guides, it looks like it only supports loading > data from files, but I want to query tables stored in memory only via JDBC. > Is that possible? > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
query tables stored in memory only via JDBC. Is that possible? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
ry. >>>> >>>> I see spark sql allows ad hoc querying through JDBC though I have never >>>> used >>>> that before. Will using JDBC offer any advantages (e.g does it have >>>> built in >>>> support for caching?) over rolling my o

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
ilt in >>> support for caching?) over rolling my own solution for this use case? >>> >>> Thanks! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.c

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
er rolling my own solution for this use case? >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html >> Sent from the

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
C offer any advantages (e.g does it have built > in > support for caching?) over rolling my own solution for this use case? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-f

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To un

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: SparkSQL , best way to divide data into partitions?

2014-10-23 Thread Michael Armbrust
Spark SQL now supports Hive style dynamic partitioning: https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions This is a new feature so you'll have to build master or wait for 1.2. On Wed, Oct 22, 2014 at 7:03 PM, raymond wrote: > Hi > > I have a json file that can be load b

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
Hi guys, another question: what’s the approach to working with column-oriented data, i.e. data with more than 1000 columns. Using Parquet for this should be fine, but how well does SparkSQL handle the big amount of columns? Is there a limit? Should we use standard Spark instead? Thanks for

SparkSQL , best way to divide data into partitions?

2014-10-22 Thread raymond
Hi I have a json file that can be load by sqlcontext.jsonfile into a table. but this table is not partitioned. Then I wish to transform this table into a partitioned table say on field “date” etc. what will be the best approaching to do this? seems in hive this is usually done

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu
ichael Armbrust mailto:mich...@databricks.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Hi Michael, Thanks again for the reply. Was hoping it was som

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust
No, analytic and window functions do not work yet. On Tue, Oct 21, 2014 at 3:00 AM, Pierre B < pierre.borckm...@realimpactanalytics.com> wrote: > Hi! > > The RANK function is available in hive since version 0.11. > When trying to use it in SparkSQL, I'm getting the foll

[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Hi! The RANK function is available in hive since version 0.11. When trying to use it in SparkSQL, I'm getting the following exception (full stacktrace below): java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be ca

Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
; > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
t; Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Have you tried this on master? There were several problems with resolution of complex queries that were registered as ta

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Michael Armbrust
titions. This task is > an effort to simulate the unsupported GROUPING SETS functionality in > SparkSQL. > > In my first attempt, I got really close using SchemaRDD.groupBy until I > realized that SchemaRDD.insertTo API does not support partitioned tables > yet. This prompted

SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
GROUP BY to write back out to a Hive rollup table that has two partitions. This task is an effort to simulate the unsupported GROUPING SETS functionality in SparkSQL. In my first attempt, I got really close using SchemaRDD.groupBy until I realized that SchemaRDD.insertTo API does not support

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin, Sorry for the delay, but I’ll try the code change when I get a chance, but Michael’s initial response did solve my problem. In the meantime, I’m hitting another issue with SparkSQL which I will probably post another message if I can’t figure a workaround. Thanks, -Terry From: Yin

RDD to Multiple Tables SparkSQL

2014-10-20 Thread critikaled
ew this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their JavaSchemaRDDs, and convert it to SchemaRDD for Scala users. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html Sent

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
owed by the 2 partition > columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks > like it’s just counting the rows in the console output. Let me know if you > need more information. > > > Thanks > > -Terry > > > From: Yin Huai > Date: T

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Cheng Lian
I want to confirm this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-16 Thread Cheng Lian
The warehouse location need to be specified before the |HiveContext| initialization, you can set it via: |./bin/spark-sql --hiveconf hive.metastore.warehouse.dir=/home/spark/hive/warehouse | On 10/15/14 8:55 PM, Hao Ren wrote: Hi, The following query in sparkSQL 1.1.0 CLI doesn't

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu mailto:terry@smartfoc

SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-15 Thread Hao Ren
Hi, The following query in sparkSQL 1.1.0 CLI doesn't work. *SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse ; create table test as select v1.*, v2.card_type, v2.card_upgrade_time_black, v2.card_upgrade_time_gold from customer v1 left join customer_loyalty v2 on v1.account_id

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
y One tell me that: Is it a good idea for me to *use catalyst as DSL's execution engine?* I am trying to build a DSL, And I want to confirm this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html S

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
e: Monday, October 13, 2014 at 5:05 PM > To: Terry Siu > Cc: "user@spark.apache.org" > Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet > > There are some known bug with the parquet serde and spark 1.1. > > You can try setting spark.sql

Re: How to patch sparkSQL on EC2?

2014-10-14 Thread Christos Kozanitis
sions for sparkSQL (for version 1.1.0) and I am > trying to deploy my new jar files (one for catalyst and one for sql/core) on > ec2. > > My approach was to create a new > spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar that merged the contents of > the old one with the content

Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Looks like it may be related to https://issues.apache.org/jira/browse/SPARK-3807. I will build from branch 1.1 to see if the issue is resolved. Chen On Tue, Oct 14, 2014 at 10:33 AM, Chen Song wrote: > Sorry for bringing this out again, as I have no clue what could have > caused this. > > I tu

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread Michael Armbrust
e used to form the table schema. As for me, StringType is > enough, why do we need others ? > > Hao > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html > Sent from th

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
e.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use

Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Sorry for bringing this out again, as I have no clue what could have caused this. I turned on DEBUG logging and did see the jar containing the SerDe class was scanned. More interestingly, I saw the same exception (org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attribut

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Thank you, Gen. I will give hiveContext a try. =) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: SparkSQL: select syntax

2014-10-14 Thread Gen
ve. Look at how you'd write this in HiveQL, and then try doing that with HiveContext./ In fact, there are more problems than that. The sparkSQL will conserve (15+5=20) columns in the final table, if I remember well. Therefore, when you are doing join on two tables which have the same columns wil

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
o actually retype all the 19 columns' name when querying with select. This feature exists in hive. But in SparkSql, it gives an exception. Any ideas ? Thx Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16364.htm

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread invkrh
.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
pache.spark.scheduler.Task.run(Task.scala:54) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPool

Does SparkSQL work with custom defined SerDe?

2014-10-13 Thread Chen Song
In Hive, the table was created with custom SerDe, in the following way. row format serde "abc.ProtobufSerDe" with serdeproperties ("serialization.class"= "abc.protobuf.generated.LogA$log_a") When I start spark-sql shell, I always got the following exception, even for a simple query. select user

SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
umns and two partitions defined. Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? Thanks, -Terry

<    4   5   6   7   8   9   10   11   12   >