Since you are running in yarn-cluster mode, and you are supply the spark
assembly jar file. There is no need to install spark on each node. Is it
possible two spark jars have different version ?
Chester
Sent from my iPad
On Jul 16, 2014, at 22:49, cmti95035 cmti95...@gmail.com wrote:
Hi,
I implemented a simple KNN classifier. And i can run it successfully on a
single sample, but it occurs an error when it is run on a test samples RDD.
I attach the source code in attachment. Look forward for you replay! Best
wishes to you!
The following is source code.
import math
from pyspark
Is this what you are looking for?
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html
According to the doc, it says Operator that acts as a sink for queries on
RDDs and can be used to store the output inside a directory of Parquet
files. This
They're all the same version. Actually even without the --jars parameter it
got the same error. Looks like it needs to copy the assembly jar for running
the example jar anyway during the staging.
--
View this message in context:
You can try the following in the spark-shell:
1. Run it in *Clustermode* by going inside the spark directory:
$ SPARK_MASTER=spark://masterip:7077 ./bin/spark-shell
val textFile = sc.textFile(hdfs://masterip/data/blah.csv)
textFile.take(10).foreach(println)
2. Now try running in *Localmode:*
Hi All,
The function *mapPartitions *in RDD.scala
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
takes
a boolean parameter *preservesPartitioning. *It seems if that parameter is
passed as *false*, the passed function f will operate on the data only
Hi Kamal,
This is not what preservesPartitioning does -- actually what it means is that
if the RDD has a Partitioner set (which means it's an RDD of key-value pairs
and the keys are grouped into a known way, e.g. hashed or range-partitioned),
your map function is not changing the partition of
Yes, both run in parallel. Random is a baseline implementation of
initialization, which may ignore small clusters. k-means++ improves
random initialization by adding weights to points far away to the
current candidates. You can view k-means|| as a more scalable version
of K-means++. We don't
Hi all,
I am a newbie Spark user with many doubts, so sorry if this is a silly
question.
I am dealing with tabular data formatted as text files, so when I first
load the data, my code is like this:
case class data_class(
V1: String,
V2: String,
V3: String,
V4: String,
V5: String,
1) This is a miss, unfortunately ... We will add support for
regularization and intercept in the coming v1.1. (JIRA:
https://issues.apache.org/jira/browse/SPARK-2550)
2) It has overflow problems in Python but not in Scala. We can
stabilize the computation by ensuring exp only takes a negative
Seems like there is some sort of stream corruption, causing Kryo read to
read a weird class name from the stream (the name arl Fridtjof Rode in
the exception cannot be a class!).
Not sure how to debug this.
@Patrick: Any idea?
On Wed, Jul 16, 2014 at 10:14 PM, Hao Wang wh.s...@gmail.com wrote:
Not sure if this helps, but it does seem to be part of a name in a
Wikipedia article, and Wikipedia is the data set. So something is
reading this class name from the data.
http://en.wikipedia.org/wiki/Carl_Fridtjof_Rode
On Thu, Jul 17, 2014 at 9:40 AM, Tathagata Das
tathagata.das1...@gmail.com
Thank you for your fast reply.
We are considering this Map[String, String] solution, but there are some
details that we do not control yet. What would happen if we have different
data types for different fields? Also, with this solution, we have to
repeat the field names for every row that we
Hi Neethu,
Your application is running on local mode and that's the reason why you are
not seeing the driver app in the 8080 webUI. You can pass the Master IP to
your pyspark and get it running in cluster mode.
eg: IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark --master
Hi,
I am getting the following error while trying save a large dataset to s3
using the saveAsHadoopFile command with apache spark-1.0.
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 PUT failed for
Hi All,
Currently we are reading (multiple) topics from Apache kafka and storing
that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
different tables).
but we are facing some performance issue with HBase.
so we are replacing* HBase* with *Parquet* file and *storm* with
Hi all,
I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this
cluster, is it possible somehow use Capacity scheduler for Spark jobs
management as well as MR jobs? I mean, I'm able to send MR job to specific
queue, may I do the same with Spark job?
thank you in advance
Thank
1. You can put in multiple kafka topics in the same Kafka input stream. See
the example KafkaWordCount
https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
.
However they will all be read
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at
Hi, all
Yes, it's a name of Wikipedia article. I am running WikipediaPageRank
example of Spark Bagels.
I am wondering whether there is any relation to buffer size of Kyro.
The page rank can be successfully finished, sometimes not because this kind
of Kyro exception happens too many times, which
Hi Akhil,
That fixed the problem...Thanks
Thanks Regards,
Meethu M
On Thursday, 17 July 2014 2:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
Hi Neethu,
Your application is running on local mode and that's the reason why you are not
seeing the driver app in the 8080 webUI. You
Hi
I am trying to implement belief propagation algorithm in GraphX using the
pragel API.
*def* pregel[A]
(initialMsg*:* A,
maxIter*:* Int = *Int*.*MaxValue*,
activeDir*:* EdgeDirection = *EdgeDirection*.*Out*)
(vprog*:* (VertexId, VD, A) *=* *VD*,
Hi,
To migrate data from *HBase *to *Parquet* we used following query through
* Impala*:
INSERT INTO table PARQUET_HASHTAGS(
key, city_name, country_name, hashtag_date, hashtag_text,
hashtag_source, hashtag_month, posted_time, hashtag_time,
tweet_id, user_id, user_name,
hashtag_year
)
Hi Xiangrui,
Yes I am using Spark v0.9 and am not running it in local mode.
I did the memory setting using export SPARK_MEM=4G before starting the Spark
instance.
Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time
Could someone please help me resolve This post has NOT been accepted by the
mailing list yet. issue. I registered and subscribed to the mailing list
many days ago but my post is still in unaccepted state.
--
View this message in context:
Hi TD,
I have been able to filter the first WindowedRDD, but I am not sure how to make
a generic filter. The larger window is 8 seconds and want to fetch 4 second
based on application-time-stamp. I have seen an earlier post which suggest
timeStampBasedwindow but I am not sure how to make
Do we have any equivalent scala functions available for NVL() and CASE
expressions to use in spark sql?
Added below 2 lines just before the sql query line -
*...*
*file1_schema.count;*
*file2_schema.count;*
*...*
and it started working. But I couldn't get the reason.
Can someone please explain me? What was happening earlier and what is
happening with addition of these 2 lines?
~Sarath
On Thu,
What version are you running? Could you provide a jstack of the driver and
executor when it is hanging?
On Thu, Jul 17, 2014 at 10:55 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Added below 2 lines just before the sql query line -
*...*
*file1_schema.count;*
If you intern the string it will be more efficient, but still significantly
more expensive than the class based approach.
** VERY EXPERIMENTAL **
We are working with EPFL on a lightweight syntax for naming the results of
spark transformations in scala (and are going to make it interoperate with
Hi guys,
sure you have similar use case and want to know how you deal with that. In
our application, we want to check the previous state of some keys and
compare with their current state.
AFAIK, Spark Streaming does not have key-value access. So current what I am
doing is storing the previous
It's possible using the --queue argument of spark-submit. Unfortunately this is
not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but
it appears if you just type spark-submit --help or spark-submit with no
arguments.
Matei
On Jul 17, 2014, at 2:33 AM, Konstantin
unsubscribe
From: Matei Zaharia matei.zaha...@gmail.com
To: user@spark.apache.org
Date: 07/17/2014 12:41 PM
Subject:Re: Spark scheduling with Capacity scheduler
It's possible using the --queue argument of spark-submit. Unfortunately
this is not documented on
Hi,
I am receiving below error while submitting any spark example or scala
application. Really appreciate any help.
spark version = 1.0.0
hadoop version = 2.4.0
Windows/Standalone mode
14/07/17 22:13:19 INFO TaskSchedulerImpl: Cancelling stage 0
Exception in thread main
Hello Experts,
I am facing performance problem when I use the UDF function call. Please help
me to tune the query.
Please find the details below
shark select count(*) from table1;
OK
151096
Time taken: 7.242 seconds
shark select count(*) from table2;
OK
938
Time taken: 1.273 seconds
Without
On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.
One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the
Please try
val parsedData3 =
data3.repartition(12).map(_.split(\t)).map(_.toDouble).cache()
and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.
-Xiangrui
On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
viora...@gmail.com wrote:
Hi
hi TD,
Thanks for the solutions for my previous post...I am running into other
issue..i am getting data from json file and i am trying to parse it and
trying to map it to a record given below
val jsonf
For accessing previous version, I would do it the same way. :)
1. Can you elaborate on what you mean by that with an example? What do you
mean by accessing keys?
2. Yeah, that is hard to do with the ability to do point lookups into an
RDD, which we dont support yet. You could try embedding the
This is a basic scala problem. You cannot apply toInt to Any. Try doing
toString.toInt
For such scala issues, I recommend trying it out in the Scala shell. For
example, you could have tried this out as the following.
[tdas @ Xion streaming] scala
Welcome to Scala version 2.10.3 (Java HotSpot(TM)
Thanks Sean,
1) Yes, I am trying to run locally without Hadoop.
2) I also see the error in the provided link while launching spark-shell but
post launch I am able to execute same code I have in the sample application.
Read any local file and perform some reduction operation. But not through
You have to define what is the range records that needs to be filtered out
in every windowed RDD, right? For example, when the DStream.window has data
from from times 0 - 8 seconds by DStream time, you only want to filter out
data that falls into say 4 - 8 seconds by application time. This latter
Hi Sean
RE: Windows and hadoop 2.4.x
HortonWorks - all the hype aside - only supports Windows Server 2008/2012.
So this general concept of supporting Windows is bunk.
Given that - and since the vast majority of Windows users do not happen to
have Windows Server on their laptop - do you have any
If your sendMsg function needs to know the incoming messages as well as the
vertex value, you could define VD to be a tuple of the vertex value and the
last received message. The vprog function would then store the incoming
messages into the tuple, allowing sendMsg to access them.
For example, if
What is the preferred way of adding a custom metrics sink to Spark? I noticed
that the Sink Trait has been private since April, so I cannot simply extend
Sink in an outside package, but I would like to avoid having to create a
custom build of Spark. Is this possible?
--
View this message in
I am probably the wrong person to ask as I never use Hadoop on Windows. But
from looking at the code just now it is clearly trying to accommodate
Windows shell commands. Yes I would not be surprised if it still needs
Cygwin.
A slightly broader point is that ideally it doesnt matter whether Hadoop
Hi TD,
Thank you for the quick replying and backing my approach. :)
1) The example is this:
1. In the first 2 second interval, after updateStateByKey, I get a few keys
and their states, say, (a - 1, b - 2, c - 3)
2. In the following 2 second interval, I only receive c and d and their
value. But
Thanks Tathagata, so can I say RDD size(from the stream) is window size.
and the overlap between 2 adjacent RDDs are sliding size.
But I still don't understand what it batch size, why do we need this since
data processing is RDD by RDD right?
And does spark chop the data into RDDs at the very
Hi Xiangrui,
Thanks. I have taken your advice and set all 5 of my slaves to be
c3.4xlarge. In this case /mnt and /mnt2 have plenty of space by default. I
now do sc.textFile(blah).repartition(N).map(...).cache() with N=80 and
spark.executor.memory to be 20gb and --driver-memory 20g. So far things
Thanks!
On Wed, Jul 16, 2014 at 6:34 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Have you taken a look at DStream.transformWith( ... ) . That allows you
apply arbitrary transformation between RDDs (of the same timestamp) of two
different streams.
So you can do something like this.
FYI: I've created SPARK-2560
https://issues.apache.org/jira/browse/SPARK-2560 to track creating SQL
reference docs for Spark SQL.
On Mon, Jul 14, 2014 at 2:06 PM, Michael Armbrust mich...@databricks.com
wrote:
You can find the parser here:
I used to use SPARK_LIBRARY_PATH to specify the location of native libs
for lzo compression when using spark 0.9.0.
The references to that environment variable have disappeared from the docs
for
spark 1.0.1 and it's not clear how to specify the location for lzo.
Any guidance?
Hi,
I also have some issues with repartition. In my program, I consume data
from Kafka. After I consume data, I use repartition(N). However, although I
set N to be 120, there are around 18 executors allocated for my reduce
stage. I am not sure how the repartition command works ton ensure the
Thanks Marcelo! This is a huge help!!
Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...
It launches the executor, but it looks like the
CoarseGrainedExecutorBackend is having trouble talking to the driver
(exactly what you said!!!).
Do you
One way is to set this in your conf/spark-defaults.conf:
spark.executor.extraLibraryPath /path/to/native/lib
The key is documented here:
http://spark.apache.org/docs/latest/configuration.html
On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I used to use
Hi guys,
need some help in this problem. In our use case, we need to continuously
insert values into the database. So our approach is to create the jdbc
object in the main method and then do the inserting operation in the
DStream foreachRDD operation. Is this approach reasonable?
Then the
Could you share some code (or pseudo-code)?
Sounds like you're instantiating the JDBC connection in the driver,
and using it inside a closure that would be run in a remote executor.
That means that the connection object would need to be serializable.
If that sounds like what you're doing, it
Good question.. I'll ask INFRA because I haven't seen other Apache mailing
lists provide this. It would indeed be helpful.
Matei
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote:
Can we modify the mailing list to include permalinks to the thread in the
footer of
The updateFunction given in updateStateByKey should be called on ALL the
keys are in the state, even if there is no new data in the batch for some
key. Is that not the behavior you see?
What do you mean by show all the existing states? You have access to the
latest state RDD by doing
And if Marcelo's guess is correct, then the right way to do this would be
to lazily / dynamically create the jdbc connection server as a singleton
in the workers/executors and use that. Something like this.
dstream.foreachRDD(rdd = {
rdd.foreachPartition((iterator: Iterator[...]) = {
Hi Tathagata,
Thanks for your answer. Please see my further question below:
On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently using Spark
I am using spark 0.9.0 and I am able to submit job to YARN,
https://spark.apache.org/docs/0.9.0/running-on-yarn.html.
I am trying to turn on gc logging on executors but could not find a way to
set extra Java opts for workers.
I tried to set spark.executor.extraJavaOptions but that did not work.
Thanks Luis and Tobias.
On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:
* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post
val transformedStream = kafkaStream.map ... // whatever transformation
you want to do
transformedStream.foreachRDD((rdd: RDD[...], time: Time) = {
// save the rdd to parquet file, using time as the file
Hi Matt,
I'm not very familiar with setup on ec2; the closest I can point you
at is to look at the launch_cluster in ec2/spark_ec2.py, where the
ports seem to be configured.
On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
Thanks Marcelo! This is a huge help!!
Hi Marcelo and TD,
Thank you for the help. If I use TD's approache, it works and there is no
exception. Only drawback is that it will create many connections to the DB,
which I was trying to avoid.
Here is a snapshot of my code. Mark as red for the important code. What I
was thinking is that, if
We don't have support for partitioned parquet yet. There is a JIRA here:
https://issues.apache.org/jira/browse/SPARK-2406
On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post
Hi Burak,
I tried running it through the Spark shell, but I still ended with the same
error message as in Hadoop:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key
must be specified as the username or password (respectively) of a s3n URL,
or by setting the
Hi,I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
workers) cluster. From the Web UI at the master, I see that the workers are
registered. But when I try running the SparkPi example from the master node,
I get the following message and then an exception.14/07/17 01:20:36
Hi,
I am trying to develop a recommender system for about 1 million users and 10
thousand items. Currently it's a simple regression based model where for
every user, item pair in dataset we generate some features and learn model
from it. Till training and evaluation everything is fine the
Thanks all! (And thanks Matei for the developer link!) I was able to
build using maven[1] but `./sbt/sbt assembly` results in build errors.
(Not familiar enough with the build to know why; in the past sbt
worked for me and maven did not).
I was able to run the master version of pyspark, which
Hi,
I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
workers) cluster.
From the Web UI at the master, I see that the workers are registered. But
when I try running the SparkPi example from the master node, I get the
following message and then an exception.
14/07/17
We are using RegressionModels that comes with *mllib* package in SPARK.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Are you suggesting that taking simple vector dot products or sigmoid
function on 10K * 1M data takes 5hrs?
On Thu, Jul 17, 2014 at 3:59 PM, m3.sharma sharm...@umn.edu wrote:
We are using RegressionModels that comes with *mllib* package in SPARK.
--
View this message in context:
I also have an issue consuming from Kafka. When I consume from Kafka, there
are always a single executor working on this job. Even I use repartition,
it seems that there is still a single executor. Does anyone has an idea how
to add parallelism to this job?
On Thu, Jul 17, 2014 at 2:06 PM, Chen
Yes, thats what prediction should be doing, taking dot products or sigmoid
function for each user,item pair. For 1 million users and 10 K items data
there are 10 billion pairs.
--
View this message in context:
Hi, All
When I run spark streaming, in one of the flatMap stage, I want to access
database.
Code looks like :
stream.flatMap(
new FlatMapFunction {
call () {
//access database cluster
}
}
)
Since I don't want to create database connection every time call() was
called, where
Hi Sean,
Thank you. I see your point. What I was thinking is that, do computation in
a distributed fashion and do the storing from a single place. But you are
right, having multiple DB connections actually is fine.
Thanks for answering my questions. That helps me understand the system.
Cheers,
Hello Spark Users,
I am new to Spark SQL and now trying to first get the HiveFromSpark example
working.
However, I got the following error when running HiveFromSpark.scala program.
May I get some help on this please?
ERROR MESSAGE:
org.apache.thrift.TApplicationException: Invalid method name:
but be aware that spark-defaults.conf is only used if you use spark-submit
On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote:
One way is to set this in your conf/spark-defaults.conf:
spark.executor.extraLibraryPath /path/to/native/lib
The key is documented here:
Hi TD,
Thank you. Yes, it behaves as you described. Sorry for missing this point.
Then my only concern is in the performance side - since Spark Streaming
operates on all the keys everytime a new batch comes, I think it is fine
when the state size is small. When the state size becomes big, say, a
Hi Matt,
The security group shouldn't be an issue; the ports listed in
`spark_ec2.py` are only for communication with the outside world.
How did you launch your application? I notice you did not launch your
driver from your Master node. What happens if you did? Another thing is
that there seems
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
I often find myself wanting to reference one thread from another, or from
a JIRA issue. Right now I have to google the thread subject and find the
link that way.
+1
Bill,
are you saying, after repartition(400), you have 400 partitions on one host
and the other hosts receive nothing of the data?
Tobias
On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
I also have an issue consuming from Kafka. When I consume from Kafka,
there
Can you check in the environment tab of Spark web ui to see whether this
configuration parameter is in effect?
TD
On Thu, Jul 17, 2014 at 2:05 PM, Chen Song chen.song...@gmail.com wrote:
I am using spark 0.9.0 and I am able to submit job to YARN,
You can create multiple kafka stream to partition your topics across them,
which will run multiple receivers or multiple executors. This is covered in
the Spark streaming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
And for the
Hi ranjanp,
If you go to the master UI (masterIP:8080), what does the first line say?
Verify that this is the same as what you expect. Another thing is that
--master in spark submit overwrites whatever you set MASTER to, so the
environment variable won't actually take effect. Another obvious
Hi Chen,
spark.executor.extraJavaOptions is introduced in Spark 1.0, not in Spark
0.9. You need to
export SPARK_JAVA_OPTS= -Dspark.config1=value1 -Dspark.config2=value2
in conf/spark-env.sh.
Let me know if that works.
Andrew
2014-07-17 18:15 GMT-07:00 Tathagata Das
Hi Jian,
In yarn-cluster mode, Spark submit automatically uploads the assembly jar
to a distributed cache that all executor containers read from, so there is
no need to manually copy the assembly jar to all nodes (or pass it through
--jars).
It seems there are two versions of the same jar in your
Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the
MapReduce. You can open connection, get all the data and buffer it, close
connection, return iterator to the buffer
Step 2: Make step 1 better, by making it reuse connections. You can use
singletons / static vars, to lazily
Hi Chris,
Did you ever figure this out? It should just work provided that your HDFS
is set up correctly. If you don't call setMaster, it actually uses the
spark://[master-node-ip]:7077 by default (this is configured in your
conf/spark-env.sh). However, even if you use a local master, it should
Yes, this is the limitation of the current implementation. But this will be
improved a lt when we have IndexedRDD
https://github.com/apache/spark/pull/1297 in the Spark that allows faster
single value updates to a key-value (within each partition, without
processing the entire partition.
I think I know what is happening to you. I've looked some into this just this
week, and so its fresh in my brain :) hope this helps.
When no workers are known to the master, iirc, you get this message.
I think this is how it works.
1) You start your master
2) You start a slave, and give it
Hey everyone,
I know this was asked before but I'm wondering if there have since been any
updates. Are there any plans to integrate iScala/Scala-notebook with spark
in the near future?
This seems like something a lot of people would find very useful, so I was
just wondering if anyone has started
Seems like the mysql connector jar is not included in the classpath.
Where can I set the jar to the classpath?
hive-site.xml:
property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=trueamp;characterEncoding=UTF-8/value
Thanks Andrew.
Say that I want to turn on CMS gc for each worker.
All I need to do is add the following line to conf/spark-env.sh on node
where I submit the application.
-XX:+UseConcMarkSweepGC
Is that correct?
Will this option be populated to each worker in yarn?
On Thu, Jul 17, 2014 at
Hi TD,
It Worked...Thank you so much for all your help.
Thanks,
-Srinivas.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Json-file-groupby-function-tp9618p10132.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You will need to include that in the SPARK_JAVA_OPTS environment variable,
so add the following line to spark-env.sh:
export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC
This should propagate to the executors. (Though you should double check,
since 0.9 is a little old and I could be forgetting
Hello,
I have an issue where my spark code is using too much memory in the final
step ( a count for testing purpose, it will write the result to a db when it
works ). I'm really not too sure how I can break down the last step to use
less RAM.
So, basically my data is log lines and each log line
when I run a query to a hadoop file.
mobile.registerAsTable(mobile)
val count = sqlContext.sql(select count(1) from mobile)
res5: org.apache.spark.sql.SchemaRDD =
SchemaRDD[21] at RDD at SchemaRDD.scala:100
== Query Plan ==
ExistingRdd [data_date#0,mobile#1,create_time#2], MapPartitionsRDD[4] at
100 matches
Mail list logo