pseudo
distributed YARN cluster. Would you mind to
elaborate more about steps to reproduce this bug?
Thanks
On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian
lian.cs@gmail.com
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0,
none of other versions are supported.
Best,
Cheng
On 1/25/15 12:18 AM, guxiaobo1982 wrote:
Hi,
I built and started a single node standalone Spark 1.2.0 cluster along
with a single node Hive 0.14.0 instance installed by
Hi Ayoub,
The doc page isn’t wrong, but it’s indeed confusing.
|spark.sql.parquet.compression.codec| is used when you’re wring Parquet
file with something like |data.saveAsParquetFile(...)|. However, you are
using Hive DDL in the example code. All Hive DDLs and commands like
|SET| are
According to the Gist Ayoub provided, the schema is fine. I reproduced
this issue locally, it should be bug, but I don't think it's related to
SPARK-5236. Will investigate this soon.
Ayoub - would you mind to help to file a JIRA for this issue? Thanks!
Cheng
On 1/30/15 11:28 AM, Michael
According to the Gist Ayoub provided, the schema is fine. I reproduced
this issue locally, it should be bug, but I don't think it's related to
SPARK-5236. Will investigate this soon.
Ayoub - would you mind to help to file a JIRA for this issue? Thanks!
Cheng
On 1/30/15 11:28 AM, Michael
Yeah, currently there isn't such a repo. However, the Spark team is
working on this.
Cheng
On 1/30/15 8:19 AM, Ayoub wrote:
I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot
see
What version of Spark and Hive are you using? Spark 1.1.0 and prior
version /only/ support Hive 0.12.0. Spark 1.2.0 supports Hive 0.12.0
/or/ 0.13.1.
Cheng
On 1/29/15 6:36 PM, QiuxuanZhu wrote:
*Dear all,
*
*I have no idea when it raises an error when I run the following code.*
*
*
def
On 1/21/15 10:39 AM, Cheng Lian wrote:
Oh yes, thanks for adding that using |sc.hadoopConfiguration.set| also
works :-)
On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska
yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com wrote:
Thanks for looking Cheng. Just to clarify in case other
Hey Jorge,
This is expected. Because there isn’t an obvious mapping from |Set[T]|
to any SQL types. Currently we have complex types like array, map, and
struct, which are inherited from Hive. In your case, I’d transform the
|Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an
On 1/27/15 5:55 PM, Cheng Lian wrote:
On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is
option
Hey Alexey,
You need to use |HiveContext| in order to access Hive UDFs. You may try
it with |bin/spark-sql| (|src| is a Hive table):
|spark-sql select key / 3 from src limit 10;
79.33
28.668
103.67
9.0
55.0
136.34
85.0
92.67
Currently no if you don't want to use Spark SQL's HiveContext. But we're
working on adding partitioning support to the external data sources API,
with which you can create, for example, partitioned Parquet tables
without using Hive.
Cheng
On 1/26/15 8:47 AM, Danny Yates wrote:
Thanks
for future groupings (assuming we cache I suppose)
Mick
On 20 Jan 2015, at 20:44, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
First of all, even if the underlying dataset is partitioned as
expected, a shuffle can’t be avoided. Because Spark SQL knows
:07 PM, Cheng Lian lian.cs@gmail.com wrote:
Hey Yana,
Sorry for the late reply, missed this important thread somehow. And many
thanks for reporting this. It turned out to be a bug — filter pushdown is
only enabled when using client side metadata, which is not expected,
because task side
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems
that this function can’t be properly resolved. Could you provide a
minimum code snippet that reproduces this issue?
Cheng
On 1/20/15 1:22 AM, Xuelin Cao wrote:
Hi,
I'm trying to migrate some hive scripts to
Guess this can be helpful:
http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases
On 1/19/15 8:26 AM, mucks17 wrote:
Hello
I use Hive on Spark and have an issue with assigning several aliases to the
output (several return values) of an UDF. I ran
I think you can resort to a Hive table partitioned by date
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
On 1/11/15 9:51 PM, Paul Wais wrote:
Dear List,
What are common approaches for addressing over a union of tables /
RDDs? E.g.
-means model from its cluster centers. -Xiangrui
On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote:
This is because KMeanModel is neither a built-in type nor a user defined
type recognized by Spark SQL. I think you can write your own UDT version of
KMeansModel in this case
Hey Surbhit,
In this case, the web UI stats is not accurate. Please refer to this
thread for an explanation:
https://www.mail-archive.com/user@spark.apache.org/msg18919.html
Cheng
On 1/13/15 1:46 AM, Surbhit wrote:
Hi,
I am using spark 1.1.0.
I am using the spark-sql shell to run all the
Hey Yana,
Sorry for the late reply, missed this important thread somehow. And many
thanks for reporting this. It turned out to be a bug — filter pushdown
is only enabled when using client side metadata, which is not expected,
because task side metadata code path is more performant. And I
You need to provide key type, value type for map type, element type for
array type, and whether they contain null:
|StructType(Array(
StructField(map_field,MapType(keyType =IntegerType, valueType =StringType,
containsNull =true), nullable =true),
. For example, |Sort| does defensive copy as it needs
to cache rows for sorting.
Keen to get the best performance and the best blend of SparkSQL and
functional Spark.
Cheers,
Nathan
From: Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com
Date: Monday, 12 January 2015 1:21 am
First of all, even if the underlying dataset is partitioned as expected,
a shuffle can’t be avoided. Because Spark SQL knows nothing about the
underlying data distribution. However, this does reduce network IO.
You can prepare your data like this (say |CustomerCode| is a string
field with
Hey Yi,
I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would
like to investigate this issue later. Would you please open an JIRA for
it? Thanks!
Cheng
On 1/19/15 1:00 AM, Yi Tian wrote:
Is there any way to support multiple users executing SQL on one thrift
server?
I
I had once worked on a named row feature but haven’t got time to finish
it. It looks like this:
|sql(...).named.map { row:NamedRow =
row[Int]('key) - row[String]('value)
}
|
Basically the |named| method generates a field name to ordinal map for
each RDD partition. This map is then shared
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a
bug in Parquet which may cause NPE, please refer to
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
This bug hasn’t been fixed in Parquet master. We’ll turn this on once
the bug is fixed.
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for
now. May I ask why do you prefer |HiveTableScan| rather than
|ParquetTableScan|?
Cheng
On 1/19/15 5:02 PM, Xiaoyu Wang wrote:
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But
set
This is because |KMeanModel| is neither a built-in type nor a user
defined type recognized by Spark SQL. I think you can write your own UDT
version of |KMeansModel| in this case. You may refer to
|o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an
example.
Cheng
On 1/20/15
operator may also cache row objects. This is very
implementation specific and may change between versions.
Cheers,
~N
From: Michael Armbrust mich...@databricks.com
mailto:mich...@databricks.com
Date: Saturday, 10 January 2015 3:41 am
To: Cheng Lian lian.cs@gmail.com mailto:lian.cs
Hey Nathan,
Thanks for sharing, this is a very interesting post :) My comments are
inlined below.
Cheng
On 1/7/15 11:53 AM, Nathan McCarthy wrote:
Hi,
I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala
via rdd.mapPartitions(…). Using the latest release 1.2.0.
Simple
Hey Xuelin, which data item in the Web UI did you check?
On 1/7/15 5:37 PM, Xuelin Cao wrote:
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only
scan the column that included in my SQL.
However, in my
the input data for each task (in the stage
detail page). And the sum of the input data for each task is also 1212.5MB
On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
Hey Xuelin, which data item in the Web UI did you check?
On 1/7/15 5
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion
is not supported though)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
The small SQL dialect provided in Spark SQL doesn't support insertion
This package is moved here: https://github.com/databricks/spark-avro
On 1/6/15 5:12 AM, yanenli2 wrote:
Hi All,
I want to use the SparkSQL to manipulate the data with Avro format. I
found a solution at https://github.com/marmbrus/sql-avro . However it
doesn't compile successfully anymore with
The |+| operator only handles numeric data types, you may register you
own concat function like this:
|sqlContext.registerFunction(concat, (s: String, t: String) = s + t)
sqlContext.sql(select concat('$', col1) from tbl)
|
Cheng
On 1/5/15 1:13 PM, RK wrote:
The issue is happening when I try
drops.
If you try like this:
cacheTable(tbl)
sql(select * from tbl).collect() sql(select name from tbl).collect()
sql(select * from tbl).collect()
Is the input data of the 3rd SQL bigger than 49.1KB?
On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs
Most of the time a NoSuchMethodError means wrong classpath settings, and
some jar file is overriden by a wrong version. In your case it could be
netty.
On 1/3/15 1:36 PM, Niranda Perera wrote:
Hi all,
I am evaluating the spark sources API released with Spark 1.2.0. But
I'm getting a
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to
enable serialization debugging information when serialization exceptions
are raised.
On 12/24/14 1:32 PM, bigdata4u wrote:
I am trying to use sql over Spark streaming using Java. But i am getting
Serialization
Hao and Lam - I think the issue here is that |registerRDDAsTable| only
creates a temporary table, which is not seen by Hive metastore.
And Michael had once given a workaround for creating external Parquet
table:
Hi Roc,
Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1
(controlled by compilation flags), versions prior 1.2.0 only works with
Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option.
Would like to add that this is due to backwards compatibility issue of
Hive metastore, AFAIK
This depends on which output format you want. For Parquet, you can
simply do this:
|hiveContext.table(some_db.some_table).saveAsParquetFile(hdfs://path/to/file)
|
On 12/23/14 5:22 PM, LinQili wrote:
Hi Leo:
Thanks for your reply.
I am talking about using hive from spark to export data from
Here is a more cleaned up version, can be used in |./sbt/sbt
hive/console| to easily reproduce this issue:
|sql(SELECT * FROM src WHERE key % 2 = 0).
sample(withReplacement =false, fraction =0.05).
registerTempTable(sampled)
println(table(sampled).queryExecution)
val query = sql(SELECT
Could you please provide a complete stacktrace? Also it would be good if
you can share your hive-site.xml as well.
On 12/23/14 4:42 PM, Dai, Kevin wrote:
Hi, there
When I use hive udf from_unixtime with the HiveContext, the job block
and the log is as follow:
sun.misc.Unsafe.park(Native
Hi Ji,
Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive
API/protocol compatibility issues. When interacting with Hive 0.11.x,
connections and simple queries may succeed, but things may go crazy in
unexpected corners (like UDF).
Cheng
On 12/22/14 4:15 PM, Ji ZHANG
secs.
Hari
On Wed, Dec 17, 2014 at 10:09 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
What kinds are the tables underlying the SchemaRDDs? Could you
please provide the DDL of the tables and the query you executed?
On 12/18/14 6:15 AM
Evert - Thanks for the instructions, this is generally useful in other
scenarios, but I think this isn’t what Shahab needs, because
|saveAsTable| actually saves the contents of the SchemaRDD into Hive.
Shahab - As Michael has answered in another thread, you may try
On 12/17/14 1:43 PM, Jerry Raj wrote:
Hi,
I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I
have two tables (backed by Parquet files) and I need to do a join
across them using a common field (user_id). This works fine using
standard SQL but not using the
Could you please file a JIRA together with the Git commit you're using?
Thanks!
On 12/18/14 2:32 AM, Hao Ren wrote:
Hi,
When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following
query does not work:
create table debug as
select v1.*
from t1 as v1 left join t2 as v2
on
Hi Schweichler,
This is an interesting and practical question. I'm not familiar with how
Tableau works, but would like to share some thoughts.
In general, big data analytics frameworks like MR and Spark tend to
perform immutable functional transformations over immutable data. Whilst
in your
It seems that the Thrift server you connected to is the original
HiveServer2 rather than Spark SQL HiveThriftServer2.
On 12/19/14 4:08 PM, jeanlyn92 wrote:
when i run the *cache table as *in the beeline which communicate
with the thrift server i got the follow error:
14/12/19 15:57:05 ERROR
There isn’t a SQL statement that directly maps |SQLContext.isCached|,
but you can use |EXPLAIN EXTENDED| to check whether the underlying
physical plan is a |InMemoryColumnarTableScan|.
On 12/13/14 7:14 AM, Judy Nash wrote:
Hello,
Few questions on Spark SQL:
1)Does Spark SQL support
There are several overloaded versions of both |jsonFile| and |jsonRDD|.
Schema inferring is kinda expensive since it requires an extra Spark
job. You can avoid schema inferring by storing the inferred schema and
then use it together with the following two methods:
* |def jsonFile(path:
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of
HiveServer2. You don't need to run Hive, but you do need a working
Metastore.
On 12/9/14 3:59 PM, Anas Mosaad wrote:
Thanks Judy, this is exactly what I'm looking for. However, and plz
forgive me if it's a dump question is:
(0.106 seconds)/
/0: jdbc:hive2://localhost:1 /
Kindly advice, what am I missing? I want to read the RDD using SQL
from outside spark-shell (i.e. like any other relational database)
On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote
)
On Tue, Dec 9, 2014 at 11:44 AM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
How did you register the table under spark-shell? Two things to
notice:
1. To interact with Hive, HiveContext instead of SQLContext must
be used.
2. `registerTempTable` doesn't
You may access it via something like |SELECT filterIp.element FROM tb|,
just like Hive. Or if you’re using Spark SQL DSL, you can use
|tb.select(filterIp.element.attr)|.
On 12/8/14 1:08 PM, Xuelin Cao wrote:
Hi,
I'm generating a Spark SQL table from an offline Json file.
The
++ seen1)
}).mapValues {case (count, seen) =
(count, seen.size)
}
|
On 12/5/14 3:47 AM, Arun Luthra wrote:
Is that Spark SQL? I'm wondering if it's possible without spark SQL.
On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
You may do
Window functions are not supported yet, but there is a PR for it:
https://github.com/apache/spark/pull/2953
On 12/5/14 12:22 PM, Dai, Kevin wrote:
Hi, ALL
How can I group by one column and order by another one, then select
the first row for each group (which is just like window function
Hey Venkat,
This behavior seems reasonable. According to the table name, I guess
here |DAgents| should be the fact table and |ContactDetails| is the dim
table. Below is an explanation of a similar query, you may see |src| as
|DAgents| and |src1| as |ContactDetails|.
|0:
You may do this:
|table(users).groupBy('zip)('zip, count('user), countDistinct('user))
|
On 12/4/14 8:47 AM, Arun Luthra wrote:
I'm wondering how to do this kind of SQL query with PairRDDFunctions.
SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip
In the Spark scala API,
:37 GMT+09:00 Cheng Lian lian.cs@gmail.com:
Spark SQL supports complex types, but casting doesn't work for complex types
right now.
On 11/25/14 4:04 PM, critikaled wrote:
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache
What’s the command line you used to build Spark? Notice that you need to
add |-Phive-thriftserver| to build the JDBC Thrift server. This profile
was once removed in in v1.1.0, but added back in v1.2.0 because of
dependency issue introduced by Scala 2.11 support.
On 11/27/14 12:53 AM,
What version are you trying to build? I was at first assuming you're
using the most recent master, but from your first mail it seems that you
were trying to build Spark v1.1.0?
On 11/27/14 12:57 PM, vdiwakar.malladi wrote:
Thanks for your response.
I'm using the following command.
mvn
Hm, then the command line you used should be fine. Actually just tried
it locally and it’s fine. Make sure to run it in the root directory of
Spark source tree (don’t |cd| into assembly).
On 11/27/14 1:35 PM, vdiwakar.malladi wrote:
Yes, I'm building it from Spark 1.1.0
Thanks in advance.
I see. As what the exception stated, Maven can’t find |unzip| to help
building PySpark. So you need a Windows version of |unzip| (probably
from MinGW or Cygwin?)
On 11/27/14 2:10 PM, vdiwakar.malladi wrote:
Thanks for your prompt responses.
I'm generating assembly jar file from windows 7
Spark SQL supports complex types, but casting doesn't work for complex
types right now.
On 11/25/14 4:04 PM, critikaled wrote:
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Doesn't
Which version are you using? Or if you are using the most recent master
or branch-1.2, which commit are you using?
On 11/25/14 4:08 PM, david wrote:
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with
SparkContext unsuccessfully.
Let me know if you need anything else.
*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Friday, November 21, 2014 8:02 PM
*To:* Judy Nash; u...@spark.incubator.apache.org
*Subject:* Re: latest Spark 1.2 thrift server fail with
NoClassDefFoundError on Guava
Hi
For the “never register a table” part, actually you /can/ use Spark SQL
without registering a table via its DSL. Say you’re going to extract an
|Int| field named |key| from the table and double it:
|import org.apache.sql.catalyst.dsl._
val data = sqc.parquetFile(path)
val double =
You're probably hitting this issue
https://issues.apache.org/jira/browse/SPARK-4532
Patrick made a fix for this https://github.com/apache/spark/pull/3398
On 11/22/14 10:39 AM, tridib wrote:
After taking today's build from master branch I started getting this error
when run spark-sql:
Class
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed
logical plan, optimized logical plan and physical plan. Also
|SchemaRDD.toDebugString| shows storage related debugging information.
On 11/21/14 4:11 AM, Gordon Benjamin wrote:
hey,
Can anyone tell me how to debug a sql
This thread might be helpful
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
On 11/20/14 4:11 AM, Mohammed Guller wrote:
Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server
with Cassandra. It would be great be if you could share
Hi Judy, could you please provide the commit SHA1 of the version you're
using? Thanks!
On 11/22/14 11:05 AM, Judy Nash wrote:
Hi,
Thrift server is failing to start for me on latest spark 1.2 branch.
I got the error below when I start thrift server.
Exception in thread main
When a field of an object is enclosed in a closure, the object itself is
also enclosed automatically, thus the object need to be serializable.
On 11/19/14 6:39 PM, Hao Ren wrote:
Hi,
When reading through ALS code, I find that:
class ALS private (
private var numUserBlocks: Int,
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :)
On 11/18/14 1:50 AM, Ted Yu wrote:
Looks like this was where you got that commandline:
http://search-hadoop.com/m/JW1q5RlPrl
Cheers
On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com
mailto:inv...@gmail.com
A not so efficient way can be this:
|val r0: RDD[OriginalRow] = ...
val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row))
val r2 = r1.keys.distinct().zipWithIndex()
val r3 = r2.join(r1).values
|
On 11/18/14 8:54 PM, shahab wrote:
Hi,
In my spark application, I am loading some
Hey Hao,
Which commit are you using? Just tried 64c6b9b with exactly the same
command line flags, couldn't reproduce this issue.
Cheng
On 11/17/14 10:02 PM, Hao Ren wrote:
Hi,
I am building spark on the most recent master branch.
I checked this page:
(Forgot to cc user mail list)
On 11/16/14 4:59 PM, Cheng Lian wrote:
Hey Sadhan,
Thanks for the additional information, this is helpful. Seems that
some Parquet internal contract was broken, but I'm not sure whether
it's caused by Spark SQL or Parquet, or even maybe the Parquet file
itself
|SQLContext.jsonFile| assumes one JSON record per line. Although I
haven’t tried yet, it seems that this |JsonInputFormat| [1] can be
helpful. You may read your original data set with
|SparkContext.hadoopFile| and |JsonInputFormat|, then transform the
resulted |RDD[String]| into a |JsonRDD|
Hi Sadhan,
Could you please provide the stack trace of the
|ArrayIndexOutOfBoundsException| (if any)? The reason why the first
query succeeds is that Spark SQL doesn’t bother reading all data from
the table to give |COUNT(*)|. In the second case, however, the whole
table is asked to be
Which version are you using? You probably hit this bug
https://issues.apache.org/jira/browse/SPARK-3421 if some field name in
the JSON contains characters other than [a-zA-Z0-9_].
This has been fixed in https://github.com/apache/spark/pull/2563
On 11/14/14 6:35 PM, vdiwakar.malladi wrote:
Hm, I'm not sure whether this is the official way to upgrade CDH Spark,
maybe you can checkout https://github.com/cloudera/spark, apply required
patches, and then compile your own version.
On 11/14/14 8:46 PM, vdiwakar.malladi wrote:
Thanks for your response. I'm using Spark 1.1.0
Currently
13, 2014 at 10:50 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
No, the columnar buffer is built in a small batching manner, the
batch size is controlled by the
|spark.sql.inMemoryColumnarStorage.batchSize| property. The
default value for this in master
HTTP is not supported yet, and I don't think there's an JIRA ticket for it.
On 11/14/14 8:21 AM, vs wrote:
Does Spark JDBC thrift server allow connections over HTTP?
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server
doesn't see to indicate this
one more question - does that mean that we still
need enough memory in the cluster to uncompress the data before it can
be compressed again or does that just read the raw data as is?
On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote
If you’re looking for executor side setup and cleanup functions, there
ain’t any yet, but you can achieve the same semantics via
|RDD.mapPartitions|.
Please check the “setup() and cleanup” section of this blog from
Cloudera for details:
can I write it like this?
rdd.mapPartition(i = setup(); i).map(...).mapPartition(i = cleanup(); i)
So I don't need to mess up the logic and still can use map, filter and
other transformations for RDD.
Jianshi
On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs
Currently there’s no way to cache the compressed sequence file directly.
Spark SQL uses in-memory columnar format while caching table rows, so we
must read all the raw data and convert them into columnar format.
However, you can enable in-memory columnar compression by setting
You may use |RDD.zipWithIndex|.
On 11/10/14 10:03 PM, Lijun Wang wrote:
Hi,
I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?
Besides, is there anyone
On 11/6/14 1:39 AM, Hao Ren wrote:
Hi,
I would like to understand the pipeline of spark's operation(transformation
and action) and some details on block storage.
Let's consider the following code:
val rdd1 = SparkContext.textFile(hdfs://...)
rdd1.map(func1).map(func2).count
For example, we
Hey Sadhan,
I really don't think this is Spark log... Unlike Shark, Spark SQL
doesn't even provide a Hive mode to let you execute queries against
Hive. Would you please check whether there is an existing HiveServer2
running there? Spark SQL HiveThriftServer2 is just a Spark port of
Hi Jean,
Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to collect column statistics, which causes this
issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059
On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Great! Thanks.
Sent from my iPad
On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote:
Hi Jean,
Thanks for reporting
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0,
and related PRs are already merged or being merged to the master branch.
On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote:
Hi,
My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise.
Or, any news
Which version of Spark and Hadoop are you using? Could you please provide
the full stack trace of the exception?
On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi,
I was trying to set up Spark SQL on a private cluster. I configured a
hive-site.xml under
Would you mind to share DDLs of all involved tables? What format are
these tables stored in? Is this issue specific to this query? I guess
Hive, Shark and Spark SQL all read from the same HDFS dataset?
On 10/27/14 3:45 PM, lyf刘钰帆 wrote:
Hi,
I am using SparkSQL 1.1.0 with cdh 4.6.0
LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE
tblB;
*发件人:*Cheng Lian [mailto:lian.cs@gmail.com]
*发 送时间:*2014年10月27日16:48
*收件人:*lyf刘钰帆; user@spark.apache.org
*主题:*Re: SparkSQL display wrong result
Would you mind to share DDLs of all involved tables? What format
I have never tried this yet, but maybe you can use an in-memory Derby
database as metastore
https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html
I'll investigate this when free, guess we can use this for Spark SQL
Hive support testing.
On 10/27/14 4:38 PM, Jianshi Huang
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore
Cheers
On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com
mailto:lian.cs@gmail.com wrote:
I have never tried this yet, but maybe you can use an in-memory Derby
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL
server. Currently Spark SQL can't run queries against SQL server. The
foreign data source API planned in Spark 1.2 can make this possible.
On 10/21/14 6:26 PM, Ashic Mahtab wrote:
Hi,
Is there a simple way to run spark
201 - 300 of 364 matches
Mail list logo