amming-guide.html
>>
>> For Avro in particular, I have been working on a library for Spark SQL.
>> Its very early code, but you can find it here:
>> https://github.com/databricks/spark-avro
>>
>> Bug reports welcome!
>>
>> Michael
>>
>> On
RDD
> files as well as SparkSQL. My question is more on how to build out the RDD
> files and best practices. I have data that is broken down by hour into
> files on HDFS in avro format. Do I need to create a separate RDD for each
> file? or using SparkSQL a separate SchemaRDD?
>
Hi,
I am new to spark. I have began to read to understand sparks RDD files
as well as SparkSQL. My question is more on how to build out the RDD files
and best practices. I have data that is broken down by hour into files on
HDFS in avro format. Do I need to create a separate RDD for
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf wrote:
>
> Is there a better way to mock this out and test Hive/metastore with
> SparkSQL?
>
I would use TestHive which creates a fresh metastore each time it is
invoked.
Hi,
Just to give some context. We are using Hive metastore with csv & Parquet
files as a part of our ETL pipeline. We query these with SparkSQL to do
some down stream work.
I'm curious whats the best way to go about testing Hive & SparkSQL? I'm
using 1.1.0
I see that the Lo
nd not much has changed there. Is the error
>>> deterministic?
>>>
>>> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>&g
nistic?
>>
>> On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen wrote:
>>
>>> Hi Michael,
>>>
>>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>>
>>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust <
>>> mich...@dat
17, 2014 at 7:04 PM, Eric Zhen wrote:
>
>> Hi Michael,
>>
>> We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
>>
>> On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust > > wrote:
>>
>>> What version of Spark SQL?
>>>
>>&g
On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust
> wrote:
>
>> What version of Spark SQL?
>>
>> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote:
>>
>>> Hi all,
>>>
>>> We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codege
Hi Michael,
We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust
wrote:
> What version of Spark SQL?
>
> On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote:
>
>> Hi all,
>>
>> We run SparkS
What version of Spark SQL?
On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote:
> Hi all,
>
> We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we
> got exceptions as below, has anyone else saw these before?
>
> java.lang.ExceptionInInitializer
; to give COUNT(*). In the second case, however, the whole table is asked
>>> to be cached lazily via the cacheTable call, thus it’s scanned to build
>>> the in-memory columnar cache. Then thing went wrong while scanning this LZO
>>> compressed Parquet file. But unfortunate
/14 5:28 AM, Sadhan Sood wrote:
While testing SparkSQL on a bunch of parquet files (basically
used to be a partition for one of our hive tables), I
encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop
Hi all,
We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got
exceptions as below, has anyone else saw these before?
java.lang.ExceptionInInitializerError
at
org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
at
Hi Cheng,
Thanks for your response.Here is the stack trace from yarn logs:
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html
Sent from the Apache Spark User List mailing list archive at
testing SparkSQL on a bunch of parquet files (basically used to
be a partition for one of our hive tables), I encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
val
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of
memory is used for caching. You may refer to details from here
http://spark.apache.org/docs/latest/configuration.html
On 11/15/14 5:43 AM, Sadhan Sood wrote:
Thanks Cheng, that was helpful. I noticed from UI that only half o
Thanks Cheng, that was helpful. I noticed from UI that only half of the
memory per executor was being used for caching, is that true? We have a 2
TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB
memory but caching still failed and what looked like from the UI was that
it u
While testing SparkSQL on a bunch of parquet files (basically used to be a
partition for one of our hive tables), I encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path
No, the columnar buffer is built in a small batching manner, the batch
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize|
property. The default value for this in master and branch-1.2 is 10,000
rows per batch.
On 11/14/14 1:27 AM, Sadhan Sood wrote:
Thanks Chneg, Just one
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian wrote:
> Currently there’s no way to cache the c
?
Thanks in advance.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Currently there’s no way to cache the compressed sequence file directly.
Spark SQL uses in-memory columnar format while caching table rows, so we
must read all the raw data and convert them into columnar format.
However, you can enable in-memory columnar compression by setting
|spark.sql.inMemo
We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?
On re running the cache statement, from the logs I see that when
collect(stage 1) fails it always leads to mapPartition(stage 0) for one
partition to be re-run. This can be seen from the collect log as well on
the container log:
rg.apache.spark.shuffle.MetadataFetchFailedException: Missing an
outp
This is the log output:
2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation
(Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS
SELECT * FROM xyz where date_prefix = 20141112'
2014-11-12 19:07:17,455 INFO Configuration.deprecation
(Configuration.java:warn
We are running spark on yarn with combined memory > 1TB and when trying to
cache a table partition(which is < 100G), seeing a lot of failed collect
stages in the UI and this never succeeds. Because of the failed collect, it
seems like the mapPartitions keep getting resubmitted. We have more than
en
ntext:
> http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscrib
-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For addit
e.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - No support for subqueries in 1.2-snapshot?
This is not supported yet. It would be great if you could open a JIRA (though
I think apache JIRA is down ATM).
On Tue, Nov 4, 2014 a
t; View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
&
This is not supported yet. It would be great if you could open a JIRA
(though I think apache JIRA is down ATM).
On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu wrote:
> I’m trying to execute a subquery inside an IN clause and am encountering
> an unsupported language feature in the parser.
>
> java
I’m trying to execute a subquery inside an IN clause and am encountering an
unsupported language feature in the parser.
java.lang.RuntimeException: Unsupported language features in query: select
customerid from sparkbug where customerid in (select customerid from sparkbug
where customerid in (
context:
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail
performs really
> well once you tune it properly.
>
> As far I understand SparkSQL under the hood performs many of these
> optimizations (order of Spark operations) and uses a more efficient storage
> format. Is this assumption correct?
>
> Has anyone done any comparison
ark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: Does SparkSQL work with custom defined SerDe?
Looks like it may be related to
https://issues.apache.org/jira/browse/SPARK-3807.
I will build from branch 1.1 to see if the issue is resolved.
Chen
On Tue, Oct 14,
gt; On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud
> wrote:
>
>> Hi,
>>
>> While testing SparkSQL on top of our Hive metastore, I am getting
>> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
>> table.
>>
>> Basically, I ha
sed to collect column statistics, which causes this
> issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
>
> Cheng
>
>> On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud
>> wrote:
>> Hi,
>>
>> While t
.
Cheng
On Fri, Oct 31, 2014 at 7:04 AM, Jean-Pascal Billaud
wrote:
> Hi,
>
> While testing SparkSQL on top of our Hive metastore, I am getting
> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
> table.
>
> Basically, I have a table "mtable" part
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.
As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?
Has anyone
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.
As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?
Has anyone
From: Soumya Simanta mailto:soumya.sima...@gmail.com>>
Date: Friday, October 31, 2014 at 4:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: SparkSQL performance
I was really surprised to see the results here, e
I was really surprised to see the results here, esp. SparkSQL "not
completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that ar
,
>
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
>
> "log4j" % "log4j" % "1.2.17"
>
> On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson <
> helena.edel...@datastax.com> wrote:
>
>> Hi Shahab,
&g
",
>
> "org.slf4j" % "slf4j-simple" % "1.7.7",
>
> "org.clapper" %% "grizzled-slf4j" % "1.0.2",
>
> "log4j" % "log4j" % "1.2.17"
>
>
> On Fri, Oct 31, 2014 at
Edelson wrote:
> Hi Shahab,
>
> I’m just curious, are you explicitly needing to use thrift? Just using the
> connector with spark does not require any thrift dependencies.
> Simply: "com.datastax.spark" %% "spark-cassandra-connector" %
> "1.1
:
> Hi,
>
> I am using the latest Cassandra-Spark Connector to access Cassandra tables
> form Spark. While I successfully managed to connect Cassandra using
> CassandraRDD, the similar SparkSQL approach does not work. Here is my code
> for both methods:
>
> import com.datasta
Hi,
I am using the latest Cassandra-Spark Connector to access Cassandra tables
form Spark. While I successfully managed to connect Cassandra using
CassandraRDD, the similar SparkSQL approach does not work. Here is my code
for both methods:
import com.datastax.spark.connector._
import
Hmmm, this looks like a bug. Can you file a JIRA?
On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud
wrote:
> Hi,
>
> While testing SparkSQL on top of our Hive metastore, I am getting
> some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
> table.
>
>
Hi,
While testing SparkSQL on top of our Hive metastore, I am getting
some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
table.
Basically, I have a table "mtable" partitioned by some "date" field in hive
and below is the scala code I am running
tract(deviceRDD).count(). The count comes out to be 1, but
there are many UIDs in "tusers" that are not in "device" - so the result is
not correct.
I would like to know the right way to do frame this query in SparkSQL.
thanks
--
View this message in context:
http://apache-s
as a string literal as follows:
> val users_with_no_device = sql_cxt.sql("SELECT COUNT (u_uid) FROM tusers
> WHERE tusers.u_uid NOT IN ("SELECT d_uid FROM device")")
> But that resulted in a compilation error.
>
> What is the right way to frame the above query in
frame the above query in Spark SQL?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-Query-error-tp17691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
me_key where value operator 'some_thing' ". BTW what do you mean by
"extract" could you direct me to api or code sample.
thanks and regards,
critikaled.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL
It works, thanks very much
Zhanfeng Huo
From: Yanbo Liang
Date: 2014-10-28 18:50
To: Zhanfeng Huo
CC: user
Subject: Re: SparkSql OutOfMemoryError
Try to increase the driver memory.
2014-10-28 17:33 GMT+08:00 Zhanfeng Huo :
Hi,friends:
I use spark(spark 1.1) sql operate data in hive-0.12
Try to increase the driver memory.
2014-10-28 17:33 GMT+08:00 Zhanfeng Huo :
> Hi,friends:
>
> I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails
> when data is large. So how to tune it ?
>
> spark-defaults.conf:
>
> spark.shuffle.consolidateFiles true
> spark.shuffle
Hi,friends:
I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when
data is large. So how to tune it ?
spark-defaults.conf:
spark.shuffle.consolidateFiles true
spark.shuffle.manager SORT
spark.akka.threads 4
spark.sql.inMemoryColumnarStorage.compressed true
PATH '/home/data/testFolder/qrytblA.txt' INTO TABLE
tblA;
LOAD DATA LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE
tblB;
*发件人:*Cheng Lian [mailto:lian.cs@gmail.com]
*发 送时间:*2014年10月27日16:48
*收件人:*lyf刘钰帆; user@spark.apache.org
*主题:*Re: SparkSQL display
Would you mind to share DDLs of all involved tables? What format are
these tables stored in? Is this issue specific to this query? I guess
Hive, Shark and Spark SQL all read from the same HDFS dataset?
On 10/27/14 3:45 PM, lyf刘钰帆 wrote:
Hi,
I am using SparkSQL 1.1.0 with cdh 4.6.0 recently
Hive guides, it looks like it only supports loading
> data from files, but I want to query tables stored in memory only via JDBC.
> Is that possible?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server
query tables stored in memory only via JDBC.
Is that possible?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html
Sent from the Apache Spark User List mailing list archive at Nabbl
ry.
>>>>
>>>> I see spark sql allows ad hoc querying through JDBC though I have never
>>>> used
>>>> that before. Will using JDBC offer any advantages (e.g does it have
>>>> built in
>>>> support for caching?) over rolling my o
ilt in
>>> support for caching?) over rolling my own solution for this use case?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.c
er rolling my own solution for this use case?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>> Sent from the
C offer any advantages (e.g does it have built
> in
> support for caching?) over rolling my own solution for this use case?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-f
in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To un
-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
Spark SQL now supports Hive style dynamic partitioning:
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
This is a new feature so you'll have to build master or wait for 1.2.
On Wed, Oct 22, 2014 at 7:03 PM, raymond wrote:
> Hi
>
> I have a json file that can be load b
Hi guys,
another question: what’s the approach to working with column-oriented data,
i.e. data with more than 1000 columns. Using Parquet for this should be fine,
but how well does SparkSQL handle the big amount of columns? Is there a limit?
Should we use standard Spark instead?
Thanks for
Hi
I have a json file that can be load by sqlcontext.jsonfile into a
table. but this table is not partitioned.
Then I wish to transform this table into a partitioned table say on
field “date” etc. what will be the best approaching to do this? seems in hive
this is usually done
1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
ichael Armbrust mailto:mich...@databricks.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes
Hi Michael,
Thanks again for the reply. Was hoping it was som
No, analytic and window functions do not work yet.
On Tue, Oct 21, 2014 at 3:00 AM, Pierre B <
pierre.borckm...@realimpactanalytics.com> wrote:
> Hi!
>
> The RANK function is available in hive since version 0.11.
> When trying to use it in SparkSQL, I'm getting the foll
Hi!
The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
ca
;
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
t;
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL - TreeNodeException for unresolved attributes
Have you tried this on master? There were several problems with resolution of
complex queries that were registered as ta
titions. This task is
> an effort to simulate the unsupported GROUPING SETS functionality in
> SparkSQL.
>
> In my first attempt, I got really close using SchemaRDD.groupBy until I
> realized that SchemaRDD.insertTo API does not support partitioned tables
> yet. This prompted
GROUP BY to write back out
to a Hive rollup table that has two partitions. This task is an effort to
simulate the unsupported GROUPING SETS functionality in SparkSQL.
In my first attempt, I got really close using SchemaRDD.groupBy until I
realized that SchemaRDD.insertTo API does not support
Hi Yin,
Sorry for the delay, but I’ll try the code change when I get a chance, but
Michael’s initial response did solve my problem. In the meantime, I’m hitting
another issue with SparkSQL which I will probably post another message if I
can’t figure a workaround.
Thanks,
-Terry
From: Yin
ew this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-Multiple-Tables-SparkSQL-tp16807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
I'm trying to give API interface to Java users. And I need to accept their
JavaSchemaRDDs, and convert it to SchemaRDD for Scala users.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html
Sent
owed by the 2 partition
> columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks
> like it’s just counting the rows in the console output. Let me know if you
> need more information.
>
>
> Thanks
>
> -Terry
>
>
> From: Yin Huai
> Date: T
I want to confirm this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The warehouse location need to be specified before the |HiveContext|
initialization, you can set it via:
|./bin/spark-sql --hiveconf
hive.metastore.warehouse.dir=/home/spark/hive/warehouse
|
On 10/15/14 8:55 PM, Hao Ren wrote:
Hi,
The following query in sparkSQL 1.1.0 CLI doesn't
.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hello Terry,
How many columns does pqt_rdt_snappy have?
Thanks,
Yin
On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu
mailto:terry@smartfoc
Hi,
The following query in sparkSQL 1.1.0 CLI doesn't work.
*SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse
;
create table test as
select v1.*, v2.card_type, v2.card_upgrade_time_black,
v2.card_upgrade_time_gold
from customer v1 left join customer_loyalty v2
on v1.account_id
y One tell me that: Is it a good idea for me to *use catalyst as
DSL's execution engine?*
I am trying to build a DSL, And I want to confirm this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482.html
S
e: Monday, October 13, 2014 at 5:05 PM
> To: Terry Siu
> Cc: "user@spark.apache.org"
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
> There are some known bug with the parquet serde and spark 1.1.
>
> You can try setting spark.sql
sions for sparkSQL (for version 1.1.0) and I am
> trying to deploy my new jar files (one for catalyst and one for sql/core) on
> ec2.
>
> My approach was to create a new
> spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar that merged the contents of
> the old one with the content
Looks like it may be related to
https://issues.apache.org/jira/browse/SPARK-3807.
I will build from branch 1.1 to see if the issue is resolved.
Chen
On Tue, Oct 14, 2014 at 10:33 AM, Chen Song wrote:
> Sorry for bringing this out again, as I have no clue what could have
> caused this.
>
> I tu
e used to form the table schema. As for me, StringType is
> enough, why do we need others ?
>
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html
> Sent from th
e.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
There are some known bug with the parquet serde and spark 1.1.
You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark
sql to use
Sorry for bringing this out again, as I have no clue what could have caused
this.
I turned on DEBUG logging and did see the jar containing the SerDe class
was scanned.
More interestingly, I saw the same exception
(org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attribut
Thank you, Gen.
I will give hiveContext a try. =)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
ve.
Look at how you'd write this in HiveQL, and then try doing that with
HiveContext./
In fact, there are more problems than that. The sparkSQL will conserve
(15+5=20) columns in the final table, if I remember well. Therefore, when
you are doing join on two tables which have the same columns wil
o actually retype all the 19 columns' name when querying with
select. This feature exists in hive.
But in SparkSql, it gives an exception.
Any ideas ? Thx
Hao
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16364.htm
.1001560.n3.nabble.com/SparkSQL-StringType-for-numeric-comparison-tp16295p16361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
pache.spark.scheduler.Task.run(Task.scala:54)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPool
In Hive, the table was created with custom SerDe, in the following way.
row format serde "abc.ProtobufSerDe"
with serdeproperties ("serialization.class"=
"abc.protobuf.generated.LogA$log_a")
When I start spark-sql shell, I always got the following exception, even
for a simple query.
select user
umns and two
partitions defined. Does this error look familiar to anyone? Could my usage of
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning
still buggy at this point in Spark 1.1.0?
Thanks,
-Terry
801 - 900 of 1106 matches
Mail list logo