I am trying to run a Hive query from Spark using HiveContext. Here is the
code
/ val conf = new SparkConf().setAppName(HiveSparkIntegrationTest)
conf.set(spark.executor.extraClassPath,
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
On Fri, Mar 6, 2015 at 2:47 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am trying to run a Hive query from Spark using HiveContext. Here is the
code
/ val conf = new SparkConf().setAppName(HiveSparkIntegrationTest)
conf.set(spark.executor.extraClassPath,
i added it
On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote:
Hi Koert,
Would you like to register this on spark-packages.org?
Burak
On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:
currently spark provides many excellent algorithms for operations
Hi,
Reading through the Spark Streaming Programming Guide, I read in the
Design Patterns for using foreachRDD:
Finally, this can be further optimized by reusing connection objects
across multiple RDDs/batches.
One can maintain a static pool of connection objects than can be reused as
RDDs of
Hm, why do you expect a factory method over a constructor? no, you
instantiate a SparkContext (if not working in the shell).
When you write your own program, you parse your own command line args.
--master yarn-client doesn't do anything unless you make it do so.
That is an arg to *Spark*
you can read this document :
http://spark.apache.org/docs/latest/building-spark.html
this might can solve you question
and if you compile spark with maven you might need to set mave option like this
befor you start compile it
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
Are you letting Spark download and run zinc for you? maybe that copy
is incomplete or corrupted. You can try removing the downloaded zinc
from build/ and try again.
Or run your own zinc.
On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf nightwolf...@gmail.com wrote:
Hey,
Trying to build latest spark
Hi tsingfu ,
Thanks for your reply, I tried with other columns but the problem is same
with other Integer columns.
Regards,
Gaurav
--
View this message in context:
Found this thread:
http://search-hadoop.com/m/JW1q5HMrge2
Cheers
On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote:
This was discussed in the past and viewed as dangerous to enable. The
biggest problem, by far, comes when you have a job that output M
partitions,
Actually, except setting spark.hadoop.validateOutputSpecs to false to disable
output validation for the whole program
Spark implementation uses a Dynamic Variable (object PairRDDFunctions)
internally to disable it in a case-by-case manner
val disableOutputSpecValidation:
Hi,
I submit spark jobs in yarn-cluster mode remotely from java code by calling
Client.submitApplication(). For some reason I want to use 1.3.0 jars on the
client side (e.g spark-yarn_2.10-1.3.0.jar) but I have
spark-assembly-1.2.1* on the cluster.
The problem is that the ApplicationMaster can't
Does Spark-SQL require installation of Hive for it to run correctly or not?
I could not tell from this statement:
https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
Thank you,
Edmon
This was discussed in the past and viewed as dangerous to enable. The
biggest problem, by far, comes when you have a job that output M
partitions, 'overwriting' a directory of data containing N M old
partitions. You suddenly have a mix of new and old data.
It doesn't match Hadoop's semantics
Adding support for overwrite flag would make saveAsXXFile more user friendly.
Cheers
On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
Hi folks,
I found that RDD:saveXXFile has no overwrite flag which I think is very
helpful. Is there any reason for this ?
--
The SchemaRDD supports the storage of user defined classes. However, in
order to do that, the user class needs to extends the UserDefinedType interface
(see for example VectorUDT in org.apache.spark.mllib.linalg).
My question is: Do the new Data Frame Structure (to be released in spark
1.3) will
Hi Edmon,
No, you do not need to install Hive to use Spark SQL.
Thanks,
Yin
On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:
Does Spark-SQL require installation of Hive for it to run correctly or
not?
I could not tell from this statement:
Hello,
I want to execute a hql script through `spark-sql` command, my script
contains:
```
ALTER TABLE xxx
DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
```
when I execute
```
spark-sql -f script.hql -hiveconf CUR_DATE=20150119
```
It throws an error like
```
cannot recognize input near
Hi there:
Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came
Dude,
please, attach the execution plan of the query and details about the
indexes.
2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com:
I have a query that's like:
Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:
sqlContext.sql(
SELECT
Do you have a reference paper to the implemented algorithm in TSQR.scala ?
On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There are couple of solvers that I've written that is part of the AMPLab
ml-matrix repo [1,2]. These aren't part of MLLib yet
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
Since we already have spark.hadoop.validateOutputSpecs config, I think
there is not much need to expose disableOutputSpecValidation
Cheers
On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Actually, except setting spark.hadoop.validateOutputSpecs to false to
disable
Hi Cesar,
Yes, you can define an UDT with the new DataFrame, the same way that
SchemaRDD did.
Jaonary
On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote:
The SchemaRDD supports the storage of user defined classes. However, in
order to do that, the user class needs to
Hi Shivaram,
Thank you for the link. I'm trying to figure out how can I port this to
mllib. May you can help me to understand how pieces fit together.
Currently, in mllib there's different types of distributed matrix :
BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile.
Why are you running install package and not just install? Probably
doesn't matter.
This sounds like you are trying to only build core without building
everything else, which you can't do in general unless you
I have a query that's like:
Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:
sqlContext.sql(
SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE
FROM (
SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE
Thanks ,you advise is usefull I just submit my job on my spark client which
config with simple configure file so it failed
when i run my job on service machine everything is okay
On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote:
Looks like an issue with your yarn setup, could you
Hi folks,
I found that RDD:saveXXFile has no overwrite flag which I think is very
helpful. Is there any reason for this ?
--
Best Regards
Jeff Zhang
Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext. From earlier in that doc:
In addition to the basic SQLContext, you can also create a HiveContext,
which provides a superset of the functionality provided by the basic
SQLContext.
First, thanks to everyone for their assistance and recommendations.
@Marcelo
I applied the patch that you recommended and am now able to get into the
shell, thank you worked great after I realized that the pom was pointing to
the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.
@Zhan
Hello,
I am using Spark 1.2.1 along with Hive 0.13.1.
I run some hive queries by using beeline and Thriftserver.
Queries I tested so far worked well except the followings:
I want to export the query output into a file at either HDFS or local fs
(ideally local fs).
There are not yet supported?
The
Hi,
I am filtering first DStream with the value in second DStream. I also want to
keep the value of second Dstream. I have done the following and having problem
with returning new RDD:
val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1:
RDD[(String,String)], rdd2 : RDD[Int])
Hi Rares,
If you dig into the descriptions for the two jobs, it will probably return
something like:
Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...
Job ID: 0
I have this PR submitted. You can merge it and try.
https://github.com/apache/spark/pull/2077
On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki
n.kuroma...@jp.fujitsu.com wrote:
Hi
I want to visualize tasks and stages in order to analyze spark jobs.
I know necessary metrics is written
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash)
Thanks.
Zhan Zhang
On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote:
Hello,
I want to execute a hql script through `spark-sql` command, my script
contains:
```
ALTER TABLE xxx
DROP PARTITION
Hello,
I am using takeSample from the Scala Spark 1.2.1 shell:
scala sc.textFile(README.md).takeSample(false, 3)
and I notice that two jobs are generated on the Spark Jobs page:
Job Id Description
1 takeSample at console:13
0 takeSample at console:13
Any ideas why the two jobs are needed?
Hi ,
For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.
On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com
wrote:
Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext. From earlier in
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference
Shivaram
On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Do you have a reference paper to the implemented algorithm in TSQR.scala ?
On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman
Hi Todd,
Looks like the thrift server can connect to metastore, but something wrong in
the executors. You can try to get the log with yarn logs -applicationID xxx”
to check why it failed. If there is no log (master or executor is not started
at all), you can go to the RM webpage, click the
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote:
Can i get document how to create that setup .i mean i need hive
integration on spark
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Hi Zhan,
I applied the patch you recommended,
https://github.com/apache/spark/pull/3409, it it now works. It was failing
with this:
Exception message:
/hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh:
line 14:
Sorry. Misunderstanding. Looks like it already worked. If you still met some
hdp.version problem, you can try it :)
Thanks.
Zhan Zhang
On Mar 6, 2015, at 11:40 AM, Zhan Zhang
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
You are using 1.2.1 right? If so, please add java-opts in
You are using 1.2.1 right? If so, please add java-opts in conf directory and
give it a try.
[root@c6401 conf]# more java-opts
-Dhdp.version=2.2.2.0-2041
Thanks.
Zhan Zhang
On Mar 6, 2015, at 11:35 AM, Todd Nist
tsind...@gmail.commailto:tsind...@gmail.com wrote:
Working great now, after applying that patch; thanks again.
On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang zzh...@hortonworks.com wrote:
Sorry. Misunderstanding. Looks like it already worked. If you still met
some hdp.version problem, you can try it :)
Thanks.
Zhan Zhang
On Mar 6, 2015,
Hi Koert,
Would you like to register this on spark-packages.org?
Burak
On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
Only if you want to configure the connection to an existing hive metastore.
On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote:
Hi ,
For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.
On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the
implementation. For this reason its only accessible to projects inside of
Spark.
On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi Cesar,
Yes, you can define an UDT with the new DataFrame, the same
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote:
Yes i want to link with existing hive metastore. Is that the right way to
link to hive metastore .
Yes.
Yes this is the problem. I want to return an RDD but it is abstract and I
cannot instantiate it. So what are other options. I have two streams and I want
to filter this stream on the basis of other and also want keep the value of
other stream. I have also tried join. But one stream has more
It is probably the time taken by the system to figure out that the worker
is down. Could you see in the logs to find what goes on when you kill the
worker?
TD
On Wed, Mar 4, 2015 at 6:20 AM, Nastooh Avessta (navesta) nave...@cisco.com
wrote:
Indeed. And am wondering if this switchover time
Tried with that. No luck. Same error on abt-interface jar. I can see maven
downloaded that jar into my .m2 cache
On Friday, March 6, 2015, 鹰 980548...@qq.com wrote:
try it with mvn -DskipTests -Pscala-2.11 clean install package
Hi all,
Is it possible to store Spark shuffled files on a distributed memory like
Tachyon instead of spilling them to disk?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Store-the-shuffled-files-in-memory-using-Tachyon-tp21944.html
Sent from the Apache
Looks like an issue with your yarn setup, could you try doing a simple
example with spark-shell?
Start the spark shell as:
$*MASTER=yarn-client bin/spark-shell*
*spark-shell *sc.parallelize(1 to 1000).collect
If that doesn't work, then make sure your yarn services are up and running
and in
53 matches
Mail list logo