This is very interesting, how to shutdown the streaming job gracefully once no
input data for some time.
A doable solution probably you can count the input data by using the
Accumulator, and anther thread (in master node) will always to get the latest
accumulator value, if there is no value
Another possibility is about the parallelism? Probably be 1 or some other small
value, since the input data size is not that big.
If in that case, probably you can try something like:
Df1.repartition(10).registerTempTable(“hospitals”);
Df2.repartition(10).registerTempTable(“counties”);
…
And
Have you ever try the DataFrame API like:
sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer the
type/schema for you.
And lateral view will help on the flatten issues,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, as
well as the “a.b[0].c”
Which version are you using? Have you tried the 1.6?
From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?
When I allocate 200g to executor, it is able to make better
Can you try to write the result into another file instead? Let's see if there
any issue in the executors side .
sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY
pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").write.parquet("/path/to/file")
Is there any improvement if you set a bigger memory for executors?
-Original Message-
From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW
s etc. will be more helpful in understanding your problem.
From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:49 AM
To: Cheng, Hao
Subject: Re: Problem with WINDOW functions?
I use 1.5.2.
Where can I get 1.6? I do not see it on http://spark.apache.org/downloads.html
T
Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and doesn’t
support the rollup, but you can check the code
https://github.com/apache/spark/pull/5080/ , which aimed to add the support,
just in case you can patch it in your own branch.
In Spark 2.0, the simple SQL Parser will
Or try Streaming SQL? Which is a simple layer on top of the Spark Streaming. ☺
https://github.com/Intel-bigdata/spark-streamingsql
From: Cassa L [mailto:lcas...@gmail.com]
Sent: Thursday, November 5, 2015 8:09 AM
To: Adrian Tanase
Cc: Stefano Baghino; user
Subject: Re: Rule Engine for Spark
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this
optimization?
From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Tuesday, November 3, 2015 4:17 AM
To: Alex Nastetsky
Cc: Cheng, Hao; user
Subject: Re: Sort Merge Join
Additionally, I'm curious if there are any
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
[Hao:] A distributed JOIN operation (either HashBased or SortBased Join)
Hi Jerry, I’ve filed a bug in jira, and also the fixing
https://issues.apache.org/jira/browse/SPARK-11364
It will be great appreciated if you can verify the PR with your case.
Thanks,
Hao
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Wednesday, October 28, 2015 8:51 AM
To: Jerry Lam
Hi Anand, can you paste the table creating statement? I’d like to reproduce
that in my local first, and BTW, which version are you using?
Hao
From: Anand Nalya [mailto:anand.na...@gmail.com]
Sent: Tuesday, October 27, 2015 11:35 PM
To: spark users
Subject: SparkSQL on hive error
Hi,
I've a
After a draft glance, seems a bug in Spark SQL, do you mind to create a jira
for this? And then I can start to fix it.
Thanks,
Hao
From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, October 28, 2015 3:13 AM
To: Marcelo Vanzin
Cc: user@spark.apache.org
Subject: Re: [Spark-SQL]:
I am not sure if we really want to support that with HiveContext, but a
workround is to use the Spark package at https://github.com/databricks/spark-csv
From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Tuesday, October 27, 2015 10:54 AM
To: Daniel Haviv; user
Subject: RE: HiveContext
One option is you can read the data via JDBC, however, probably it's the worst
option, as you probably need some hacky work to enable the parallel reading in
Spark SQL.
Another option is copy the hive-site.xml of your Hive Server to
$SPARK_HOME/conf, then Spark SQL will see everything that Hive
A join B join C === (A join B) join C
Semantically they are equivalent, right?
From: Richard Eggert [mailto:richard.egg...@gmail.com]
Sent: Monday, October 12, 2015 5:12 AM
To: Subhajit Purkayastha
Cc: User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?
It's the same as joining 2.
Spark SQL supports very basic join reordering optimization, based on the raw
table data size, this was added couple major releases back.
And the “EXPLAIN EXTENDED query” command is a very informative tool to verify
whether the optimization taking effect.
From: Raajay
hih...@gmail.com]
Sent: Monday, October 12, 2015 8:37 AM
To: Cheng, Hao
Cc: Richard Eggert; Subhajit Purkayastha; User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?
Some weekend reading:
http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative
Cheers
On Sun, Oct 11, 2015 a
Probably you have to read the source code, I am not sure if there are any .ppt
or slides.
Hao
From: VJ Anand [mailto:vjan...@sankia.com]
Sent: Monday, October 12, 2015 11:43 AM
To: Cheng, Hao
Cc: Raajay; user@spark.apache.org
Subject: Re: Join Order Optimization
Hi - Is there a design document
, October 12, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Join Order Optimization
Hi Cheng,
Could you point me to the JIRA that introduced this change ?
Also, is this SPARK-2211 the right issue to follow for cost-based optimization?
Thanks
Raajay
On Sun, Oct 11, 2015 at 7
I think DF performs the same as the SQL API does in the multi-inserts, if you
don’t use the cached table.
Hao
From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Friday, October 9, 2015 3:09 PM
To: Cheng, Hao
Cc: user
Subject: Re: Insert via HiveContext is slow
Thanks Hao
I think that’s a known performance issue(Compared to Hive) of Spark SQL in
multi-inserts.
A workaround is create a temp cached table for the projection first, and then
do the multiple inserts base on the cached table.
We are actually working on the POC of some similar cases, hopefully it comes
Yes, should be the same, as they are just different frontend, but the same
thing in optimization / execution.
-Original Message-
From: sanderg [mailto:s.gee...@wimionline.be]
Sent: Tuesday, September 22, 2015 10:06 PM
To: user@spark.apache.org
Subject: Performance Spark SQL vs Dataframe
Probably a workable solution is, create your own SQLContext by extending the
class HiveContext, and override the `analyzer`, and add your own rule to do the
hacking.
From: r7raul1...@163.com [mailto:r7raul1...@163.com]
Sent: Thursday, September 17, 2015 11:08 AM
To: Cheng, Hao; user
Subject: Re
Catalyst TreeNode is very fundamental API, not sure what kind of hook you need.
Any concrete example will be more helpful to understand your requirement.
Hao
From: r7raul1...@163.com [mailto:r7raul1...@163.com]
Sent: Thursday, September 17, 2015 10:54 AM
To: user
Subject: spark sql hook
I
.
From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with
spark 1.4.1 SQL
Thanks Hao for the reply.
I turn the merge sort join off
, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+
compared with spark 1.4.1 SQL
I add the following two options:
spark.sql.planner.sortMergeJoin=false
This is not a big surprise the SMJ is slower than the HashJoin, as we do not
fully utilize the sorting yet, more details can be found at
https://issues.apache.org/jira/browse/SPARK-2926 .
Anyway, can you disable the sort merge join by
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and
Will that be helpful if adding jvm options like:
-XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, September 10, 2015 5:31 AM
To: Sandy Ryza
Cc: user@spark.apache.org
Subject: Re: Driver OOM after upgrading to 1.5
It's
Hi, can you try something like:
val rowRDD=sc.textFile(/user/spark/short_model).map{ line =
val p = line.split(\\tfile:///\\t)
if (p.length =72) {
Row(p(0), p(1)…)
} else {
throw new RuntimeException(s“failed in parsing $line”)
}
}
From the log
Ok, I see, thanks for the correction, but this should be optimized.
From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 2:08 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?
That's two jobs. `SparkPlan.executeTake
O, Sorry, I miss reading your reply!
I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2
jobs, not 2 tasks.
From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 1:29 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re: DataFrame
Did you register temp table via the beeline or in a new Spark SQL CLI?
As I know, the temp table cannot cross the HiveContext.
Hao
From: Udit Mehta [mailto:ume...@groupon.com]
Sent: Wednesday, August 26, 2015 8:19 AM
To: user
Subject: Spark thrift server on yarn
Hi,
I am trying to start a
The first job is to infer the json schema, and the second one is what you mean
of the query.
You can provide the schema while loading the json file, like below:
sqlContext.read.schema(xxx).json(“…”)?
Hao
From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To:
And be sure the hive-site.xml is under the classpath or under the path of
$SPARK_HOME/conf
Hao
From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in]
Sent: Monday, August 24, 2015 8:57 PM
To: user
Subject: Re: Loading already existing tables in spark shell
Hi Jeetendra,
I faced
loading the data for JSON, it’s probably causes longer time for ramp up with
large number of files/partitions.
From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Tuesday, August 25, 2015 8:11 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?
Hi Cheng,
I
Yes, check the source code under:
https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst
From: Todd [mailto:bit1...@163.com]
Sent: Tuesday, August 25, 2015 1:01 PM
To: user@spark.apache.org
Subject: Test case for the spark sql catalyst
Hi, Are
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false.
BTW, which version are you using?
Hao
From: Jerrick Hoang [mailto:jerrickho...@gmail.com]
Sent: Thursday, August 20, 2015 12:16 PM
To: Philip Weaver
Cc: user
Subject: Re: Spark Sql behaves strangely with tables with
20, 2015 1:46 PM
To: Cheng, Hao
Cc: Philip Weaver; user
Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs
trying to speed up spark sql with tables with a huge number of partitions, I've
made
Refreshing table only works for the Spark SQL DataSource in my understanding,
apparently here, you’re running a Hive Table.
Can you try to create a table like:
|CREATE TEMPORARY TABLE parquetTable (a int, b string)
|USING org.apache.spark.sql.parquet.DefaultSource
That's a good question, we don't support reading small files in a single
partition yet, but it's definitely an issue we need to optimize, do you mind to
create a jira issue for this? Hopefully we can merge that in 1.6 release.
200 is the default partition number for parallel tasks after the
Definitely worth to try. And you can sort the record before writing out, and
then you will get the parquet files without overlapping keys.
Let us know if that helps.
Hao
From: Philip Weaver [mailto:philip.wea...@gmail.com]
Sent: Wednesday, August 12, 2015 4:05 AM
To: Cheng Lian
Cc: user
Have you ever try query the “select * from temp_table” from the spark shell? Or
can you try the option --jars while starting the spark shell?
From: Srikanth [mailto:srikanth...@gmail.com]
Sent: Thursday, July 16, 2015 9:36 AM
To: user
Subject: Re: HiveThriftServer2.startWithContext error with
Actually it's supposed to be part of Spark 1.5 release, see
https://issues.apache.org/jira/browse/SPARK-8230
You're definitely welcome to contribute to it, let me know if you have any
question on implementing it.
Cheng Hao
-Original Message-
From: pedro [mailto:ski.rodrig...@gmail.com
Can you describe how did you cache the tables? In another HiveContext? AFAIK,
cached table only be visible within the same HiveContext, you probably need to
execute the sql query like
“cache table mytable as SELECT xxx” in the JDBC connection also.
Cheng Hao
From: Brandon White
So you’re with different HiveContext instances for the caching. We are not
expected to see the cached tables cached with the other HiveContext instance.
From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Wednesday, July 15, 2015 8:48 AM
To: Cheng, Hao
Cc: user
Subject: Re: How do you
Never mind, I’ve created the jira issue at
https://issues.apache.org/jira/browse/SPARK-8972.
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Friday, July 10, 2015 9:15 AM
To: yana.kadiy...@gmail.com; ayan guha
Cc: user
Subject: RE: [SparkSQL] Incorrect ROLLUP results
Yes, this is a bug, do
Yes, this is a bug, do you mind to create a jira issue for this? I will fix
this asap.
BTW, what’s your spark version?
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Friday, July 10, 2015 12:16 AM
To: ayan guha
Cc: user
Subject: Re: [SparkSQL] Incorrect ROLLUP results
dataframe.limit(1).selectExpr(xxx).collect()?
-Original Message-
From: chrish2312 [mailto:c...@palantir.com]
Sent: Wednesday, July 8, 2015 6:20 AM
To: user@spark.apache.org
Subject: Hive UDFs
I know the typical way to apply a hive UDF to a dataframe is basically
something like:
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
org.apache.derby.jdbc.EmbeddedDriver
It will be included in the assembly jar usually, not sure what's wrong. But can
you try add the derby jar into the driver classpath and try again?
-Original Message-
From: bdev
Yes, with should be with HiveContext, not SQLContext.
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, June 23, 2015 2:51 AM
To: smazumder
Cc: user
Subject: Re: Support for Windowing and Analytics functions in Spark SQL
1.4 supports it
On 23 Jun 2015 02:59, Sourav Mazumder
It’s actually not that tricky.
SPARK_WORKER_CORES: is the max task thread pool size of the of the executor,
the same saying of “one executor with 32 cores and the executor could execute
32 tasks simultaneously”. Spark doesn’t care about how much real physical
CPU/Cores you have (OS does), so
Yes, it is thread safe. That’s how Spark SQL JDBC Server works.
Cheng Hao
From: V Dineshkumar [mailto:developer.dines...@gmail.com]
Sent: Wednesday, June 17, 2015 9:44 PM
To: user@spark.apache.org
Subject: Is HiveContext Thread Safe?
Hi,
I have a HiveContext which I am using in multiple
Seems you're hitting the self-join, currently Spark SQL won't cache any
result/logical tree for further analyzing or computing for self-join. Since the
logical tree is huge, it's reasonable to take long time in generating its tree
string recursively. And I also doubt the computing can finish
Not sure if Spark RDD will provide API to fetch the record one by one from the
final result set, instead of the pulling them all / (or whole partition data)
and fit in the driver memory.
Seems a big change.
From: Cheng Lian [mailto:l...@databricks.com]
Sent: Friday, June 12, 2015 3:51 PM
To:
Not sure if Spark Core will provide API to fetch the record one by one from the
block manager, instead of the pulling them all into the driver memory.
From: Cheng Lian [mailto:l...@databricks.com]
Sent: Friday, June 12, 2015 3:51 PM
To: 姜超才; Hester wang; user@spark.apache.org
Subject: Re: 回复:
Is it the large result set return from the Thrift Server? And can you paste the
SQL and physical plan?
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, June 9, 2015 12:01 PM
To: Sourav Mazumder
Cc: user
Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing
Confirmed, with latest master, we don't support complex data type for Simple
Hive UDF, do you mind file an issue in jira?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Friday, June 5, 2015 12:35 PM
To: ogoh; user@spark.apache.org
Subject: RE: SparkSQL : using
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?
-Original Message-
From: ogoh [mailto:oke...@gmail.com]
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws rror:
scala.MatchError: interface
Yes, but be sure you put the hive-site.xml under your class path.
Any problem you meet?
Cheng Hao
From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID]
Sent: Thursday, May 28, 2015 8:53 AM
To: user
Subject: Pointing SparkSQL to existing Hive Metadata with data file locations
Thanks for reporting this.
We intend to support the multiple metastore versions in a single
build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re
hitting the bug, please file a jira issue for this.
I will keep investigating on this also.
Hao
From: Mark Hamstra
Forgot to import the implicit functions/classes?
import sqlContext.implicits._
From: Rajdeep Dua [mailto:rajdeep@gmail.com]
Sent: Monday, May 18, 2015 8:08 AM
To: user@spark.apache.org
Subject: InferredSchema Example in Spark-SQL
Hi All,
Was trying the Inferred Schema spart example
Typo? Should be .toDF(), not .toRD()
From: Ram Sriharsha [mailto:sriharsha@gmail.com]
Sent: Monday, May 18, 2015 8:31 AM
To: Rajdeep Dua
Cc: user
Subject: Re: InferredSchema Example in Spark-SQL
you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring
schema from the
Spark SQL just take the JDBC as a new data source, the same as we need to
support loading data from a .csv or .json.
From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 2:30 PM
To: User
Subject: What's the advantage features of Spark SQL(JDBC)
Hi All,
Comparing
You probably can try something like:
val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key
group by c1)
df.cache() // Cache the result, but it's a lazy execution.
df.registerAsTempTable(my_result)
sqlContext.sql(select * from my_result where c1=1).collect // the cache
Yes.
From: Yi Zhang [mailto:zhangy...@yahoo.com]
Sent: Friday, May 15, 2015 2:51 PM
To: Cheng, Hao; User
Subject: Re: What's the advantage features of Spark SQL(JDBC)
@Hao,
As you said, there is no advantage feature for JDBC, it just provides unified
api to support different data sources
, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining_2_tables.
Hi guys,
attache the pic of physical plan and logs.Thanks.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Cheng, Hao hao.ch
Can you print out the physical plan?
EXPLAIN SELECT xxx…
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining 2 tables.
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics
I assume you’re using the DataFrame API within your application.
sql(“SELECT…”).explain(true)
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
You can use
Or, have you ever try broadcast join?
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复:Re: sparksql running slow while joining 2 tables.
Can you print out the physical plan?
EXPLAIN SELECT xxx
Hi, can you describe a little bit how the ThriftServer crashed, or steps to
reproduce that? It’s probably a bug of ThriftServer.
Thanks,
From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk]
Sent: Friday, April 24, 2015 9:55 AM
To: Arush Kharbanda
Cc: user
Subject: Re: Re: problem
The Dataframe API should be perfectly helpful in this case.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Some code snippet will like:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import
, but that’s still on going.
Cheng Hao
From: Masf [mailto:masfwo...@gmail.com]
Sent: Thursday, April 2, 2015 11:47 PM
To: user@spark.apache.org
Subject: Spark SQL. Memory consumption
Hi.
I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS
SELECT
This is a very interesting issue, the root reason for the lower performance
probably is, in Scala UDF, Spark SQL converts the data type from internal
representation to Scala representation via Scala reflection recursively.
Can you create a Jira issue for tracking this? I can start to work on
Not so sure your intention, but something like SELECT sum(val1), sum(val2)
FROM table GROUP BY src, dest ?
-Original Message-
From: Shailesh Birari [mailto:sbirar...@gmail.com]
Sent: Friday, March 20, 2015 9:31 AM
To: user@spark.apache.org
Subject: Spark SQL Self join with agreegate
Seems the elasticsearch-hadoop project was built with an old version of Spark,
and then you upgraded the Spark version in execution env, as I know the
StructField changed the definition in Spark 1.2, can you confirm the version
problem first?
From: Todd Nist [mailto:tsind...@gmail.com]
Sent:
Or you need to specify the jars either in configuration or
bin/spark-sql --jars mysql-connector-xx.jar
From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Monday, March 16, 2015 2:04 PM
To: sandeep vura; Ted Yu
Cc: user
Subject: Re: Re: Unable to instantiate
It doesn’t take effect if just putting jar files under the lib-managed/jars
folder, you need to put that under class path explicitly.
From: sandeep vura [mailto:sandeepv...@gmail.com]
Sent: Monday, March 16, 2015 2:21 PM
To: Cheng, Hao
Cc: fightf...@163.com; Ted Yu; user
Subject: Re: Re: Unable
check the configuration file of
$SPARK_HOME/conf/spark-xxx.conf ?
Cheng Hao
From: Grandl Robert [mailto:rgra...@yahoo.com.INVALID]
Sent: Thursday, March 12, 2015 5:07 AM
To: user@spark.apache.org
Subject: Spark SQL using Hive metastore
Hi guys,
I am a newbie in running Spark SQL / Spark. My goal
You can add the additional jar when submitting your job, something like:
./bin/spark-submit --jars xx.jar …
More options can be listed by just typing ./bin/spark-submit
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 10, 2015 8:48 PM
To: user@spark.apache.org
Subject: Does
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF,
but it can work seamless with Hive UDTF.
I am working on the UDTF refactoring for Spark SQL, hopefully will provide an
Hive independent UDTF soon after that.
From: shahab [mailto:shahab.mok...@gmail.com]
Sent:
/pull/3247
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Wednesday, March 11, 2015 1:44 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not UDTF
I am not so sure if Hive supports change the metastore after initialized, I
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably
that's why it doesn't work as expected for Q1.
BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not
Intel has a prototype for doing this, SaiSai and Jason are the authors.
Probably you can ask them for some materials.
From: Mohit Anchlia [mailto:mohitanch...@gmail.com]
Sent: Wednesday, March 11, 2015 8:12 AM
To: user@spark.apache.org
Subject: SQL with Spark Streaming
Does Spark Streaming also
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP
code first.
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Thursday, March 5, 2015 4:57 PM
To: user@spark.apache.org
Subject: Connection PHP application to Spark Sql thrift server
We
I’ve tried with latest code, seems it works, which version are you using Shahab?
From: yana [mailto:yana.kadiy...@gmail.com]
Sent: Wednesday, March 4, 2015 8:47 PM
To: shahab; user@spark.apache.org
Subject: RE: Does SparkSQL support . having count (fieldname) in SQL
statement?
I think the
Can you provide the detailed failure call stack?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 3:52 PM
To: user@spark.apache.org
Subject: Supporting Hive features in Spark SQL Thrift JDBC server
Hi,
According to Spark SQL documentation, Spark SQL supports the
Using where('age =10 'age =4) instead.
-Original Message-
From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Tuesday, March 3, 2015 5:14 PM
To: user
Subject: SparkSQL, executing an OR
I'm trying to execute a query with Spark.
(Example from the Spark Documentation)
val teenagers
Hive UDF are only applicable for HiveContext and its subclass instance, is the
CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 5:10 PM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re
Using the SchemaRDD / DataFrame API via HiveContext
Assume you're using the latest code, something probably like:
val hc = new HiveContext(sc)
import hc.implicits._
existedRdd.toDF().insertInto(hivetable)
or
existedRdd.toDF().registerTempTable(mydata)
hc.sql(insert into hivetable as select xxx
As the call stack shows, the mongodb connector is not compatible with the Spark
SQL Data Source interface. The latest Data Source API is changed since 1.2,
probably you need to confirm which spark version the MongoDB Connector build
against.
By the way, a well format call stack will be more
” while starting the spark shell.
From: Anusha Shamanur [mailto:anushas...@gmail.com]
Sent: Wednesday, March 4, 2015 5:07 AM
To: Cheng, Hao
Subject: Re: Spark SQL Thrift Server start exception :
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Hi,
I am getting
instance.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 3, 2015 7:56 AM
To: Cheng, Hao; user
Subject: RE: Is SQLContext thread-safe?
Thanks for the response.
Then I have another question: when will we want to create multiple SQLContext
instances
Copy those jars into the $SPARK_HOME/lib/
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Tuesday,
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the
–Phive and –Phive-thriftserver flags during the build, most likely it will not
work if just by providing the Hive lib jars later on. For example, does the
HiveContext class exist in the assembly jar?
I am also quite
https://issues.apache.org/jira/browse/SPARK-2087
https://github.com/apache/spark/pull/4382
I am working on the prototype, but will be updated soon.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 3, 2015 8:32 AM
To: Cheng, Hao; user
Subject: RE
This is actually a quite open question, from my understanding, there're
probably ways to tune like:
*SQL Configurations like:
Configuration Key
Default Value
spark.sql.autoBroadcastJoinThreshold
10 * 1024 * 1024
spark.sql.defaultSizeInBytes
10 * 1024 * 1024 + 1
Yes it is thread safe, at least it's supposed to be.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Monday, March 2, 2015 4:43 PM
To: user
Subject: Is SQLContext thread-safe?
Hi, is it safe to use the same SQLContext to do Select operations in different
threads
$.main(SparkSQLCLIDriver.scala:202)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
Thanks,
Cheng Hao
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
1 - 100 of 148 matches
Mail list logo