I am trying to run a spark application with
-Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000
-Dspark.akka.frameSize=1
and the job fails because one or more of the akka frames are larger than
1mb (12000 ish).
When I change the Dspark.akka.frameSize=1 to
1) Parameters like --num-executors should come before the jar. That is, you
want something like$SPARK_HOME --num-executors 3 --driver-memory 6g
--executor-memory 7g \--master yarn-cluster --class EDDApp
target/scala-2.10/eddjar \outputPath
That is, *your* parameters come after the jar,
Hi Michael,
judging from the logs, it seems that those tasks are just working a really
long time. If you have long running tasks, then you wouldn't expect the
driver to output anything while those tasks are working.
What is unusual is that there is no activity during all that time the tasks
are
Hi Corey,
I see. Thanks for making it clear. I may be lucky not hitting code path of
such Guava classes. But I hit some other jar conflicts when using Spark to
connect to AWS S3. Then I had to manually try each version of
org.apache.httpcomponents until I found a proper old version.
Another
That work is from more than an year back and is not maintained anymore
since we do not use it inhouse now.
Also note that there have been quite a lot of changes in spark ...
Including some which break assumptions made in the patch, so it's value is
very low - having said that, do feel free to work
Also, do you see any lines in the YARN NodeManager logs where it says that
it's killing a container?
-Sandy
On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid iras...@cloudera.com wrote:
Hi Michael,
judging from the logs, it seems that those tasks are just working a really
long time. If you have
You can easily understand the flow by looking at the number of operations
in your program (like map, groupBy, join etc.), first of all you list out
the number of operations happening in your application and then from the
webui you will be able to see how many operations have happened so far.
Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case,
the operation is SQLContext.JsonFile() and I can see from Ganglia that the
whole cluster is CPU bound (99% saturated). I have 160 cores and I can see
the network can sustain about 150MBit/s.
Kelvin
On Wed, Feb 4, 2015 at
But there isn't a 1-1 mapping from operations to stages since multiple
operations will be pipelined into a single stage if no shuffle is
required. To determine the number of stages in a job you really need to be
looking for shuffle boundaries.
On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das
AFAIK, From Spark 1.2.0 you can have WAL (Write Ahead Logs) for fault
tolerance, which means it can handle the receiver/driver failures. You can
also look at the lowlevel kafka consumer
https://github.com/dibbhatt/kafka-spark-consumer which has a better fault
tolerance mechanism for receiver
Greetings!
Thanks to all who have taken the time to look at this.
While the process is stalled, I see, in the yarn log on the head node,
repeating messages of the form Trying to fulfill reservation for application
XXX on node YYY, but that node is is reserved by XXX_01. Below is a chunk
of
I was using hive context an not sql context, therefore (SET
spark.sql.parquet.compression.codec=gzip) was ignored.
Michael Armbrust pointed out that parquet.compression should be used
instead, witch solved the issue.
I am still wondering if this behavior is normal, it would be better if
Hi Ayoub,
You could try using the sql format to set the compression type:
sc = SparkContext()
sqc = SQLContext(sc)
sqc.sql(SET spark.sql.parquet.compression.codec=gzip)
You get a notification on screen while running the spark job when you set
the compression codec like this. I havent compared
Hi Corey,
When you run on Yarn, Yarn's libraries are placed in the classpath,
and they have precedence over your app's. So, with Spark 1.2, you'll
get Guava 11 in your classpath (with Spark 1.1 and earlier you'd get
Guava 14 from Spark, so still a problem for you).
Right now, the option Markus
The latter would be faster. With S3, you want to maximize number of
concurrent readers until you hit your network throughput limits.
On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko petro.rude...@gmail.com
wrote:
Hi if i have a 10GB file on s3 and set 10 partitions, would it be
download whole
the whole spark.files.userClassPathFirs never really worked for me in
standalone mode, since jars were added dynamically which means they had
different classloaders leading to a real classloader hell if you tried to
add a newer version of jar that spark already used. see:
My mistake Marcello, I was looking at the wrong message. That reply was
meant for bo yang.
On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Corey,
On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote:
Another suggestion is to build Spark by yourself.
sqlContext.table(tableName).schema()
On Wed, Feb 4, 2015 at 1:07 PM, Ayoub benali.ayoub.i...@gmail.com wrote:
Given a hive context you could execute:
hiveContext.sql(describe TABLE_NAME) you would get the name of the
fields and their types
2015-02-04 21:47 GMT+01:00 nitinkak001 [hidden
Spark 1.2
Data is read from parquet with 2 partitions and is cached as table with 2
partitions. Verified in UI that it shows RDD with 2 partitions it is
fully cached in memory
Cached data contains column a, b, c. Column a has ~150 distinct values.
Next run SQL on this table as select a, sum(b),
marcelo,
i was not aware of those fixes. its a fulltime job to keep up with spark...
i will take another look. it would be great if that works on spark
standalone also and resolves the issues i experienced before.
about putting stuff on classpath before spark or yarn... yeah you can shoot
Hi Koert,
On Wed, Feb 4, 2015 at 11:35 AM, Koert Kuipers ko...@tresata.com wrote:
do i understand it correctly that on yarn the the customer jars are truly
placed before the yarn and spark jars on classpath? meaning at container
construction time, on the same classloader? that would be great
I am running a job, part of which is to add some null values to the rows of
a SchemaRDD. The job fails with Total size of serialized results of 2692
tasks (1024.1 MB) is bigger than spark.driver.maxResultSize(1024.0 MB)
This is the code:
val in = sqc.parquetFile(...)
..
val presentColProj:
Given a hive context you could execute:
hiveContext.sql(describe TABLE_NAME) you would get the name of the fields
and their types
2015-02-04 21:47 GMT+01:00 nitinkak001 nitinkak...@gmail.com:
I want to get a Hive table schema details into Spark. Specifically, I want
to
get column name and
On Wed, Feb 4, 2015 at 1:12 PM, Koert Kuipers ko...@tresata.com wrote:
about putting stuff on classpath before spark or yarn... yeah you can shoot
yourself in the foot with it, but since the container is isolated it should
be ok, no? we have been using HADOOP_USER_CLASSPATH_FIRST forever with
Thank you very much for the detailed answer. I feel a little dumb asking
but how would that work when using Scala (we use Spark 1.0.2)?
I can not figure it out. E.g. I am having trouble using UnionPartition
and NewHadoopPartition or even ds.values() is not an option for me (in the
IDE). Do you
i have never been a big fan of shading, since it leads to the same library
being packaged many times. for example, its not unusual to have ASM 10
times in a jar because of the shading policy they promote. and all that
because they broke one signature and without a really good reason.
i am a big
Hi Corey,
On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote:
Another suggestion is to build Spark by yourself.
I'm having trouble seeing what you mean here, Marcelo. Guava is already
shaded to a different package for the 1.2.0 release. It shouldn't be causing
conflicts.
yep...it was unnecessary to create a 2.5 profile. i struggled a bit because
it wasn't clear that i *need* to select a profile using -P option. i didn't
have to do that for earlier hadoop versions.
On Fri, Jan 30, 2015 at 12:11 AM, Sean Owen so...@cloudera.com wrote:
There is no need for a 2.5
Many operations in spark are lazy -- most likely your collect() statement
is actually forcing evaluation of severals steps earlier in the pipeline.
The logs the UI might give you some info about all the stages that are
being run when you get to collect().
I think collect() is just fine if you
Hi,
I am trying out the tableau beta connector to Spark SQL. I have few basics
question:
Will this connector be able to fetch the schemaRDDs into tableau.
Will all the schemaRDDs be exposed to tableau?
Basically I am not getting what tableau will fetch at data-source? Is it
existing files in HDFS?
Hey Spark gurus! Sorry for the confusing title. I do not know the exactly
description of my problem, if you know please tell me so I can change it :-)
Say I have two RDDs right now, and they are
val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6
val rdd2 = sc.parallelize(List((2,(1)),
Follow up for closure on thread ...
1. spark.sql.shuffle.partitions is not on config page but is mentioned on
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be
better to have it in config page as well for sake of completeness. Should I
file a doc bug ?
2. Regarding my #2
Hi,
I would like to submit spark jobs one by one, in that the next job will not
be submitted until the previous one succeeds. Spark_submit can only submit
jobs asynchronously. Is there any way I can submit jobs sequentially?
Thanks.
Ey-Chih Chow
--
View this message in context:
Hi Manoj,
You can set the number of partitions you want your sql query to use. By
default it is 200 and thus you see that number. You can update it using the
spark.sql.shuffle.partitions property
spark.sql.shuffle.partitions200Configures the number of partitions to use
when shuffling data for
anyhow i am ranting... sorry
On Wed, Feb 4, 2015 at 5:54 PM, Koert Kuipers ko...@tresata.com wrote:
yeah i think we have been lucky so far. but i dont really see how i have a
choice. it would be fine if say hadoop exposes a very small set of
libraries as part of the classpath. but if i look
Hi,
Any thoughts on this?
Most of the 200 tasks in collect phase take less than 20 ms and lot of time
is spent on scheduling these. I suspect the overall time will reduce if #
of tasks are dropped to a much smaller # (close to # of partitions?)
Any help is appreciated
Thanks
On Wed, Feb 4,
The unit of spark.akka.frameSize is MB. The max value is 2047.
Best Regards,
Shixiong Zhu
2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com:
I am trying to run a spark application with
-Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000
-Dspark.akka.frameSize=1
Could you clarify why you need a 10G akka frame size?
Best Regards,
Shixiong Zhu
2015-02-05 9:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:
The unit of spark.akka.frameSize is MB. The max value is 2047.
Best Regards,
Shixiong Zhu
2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com:
I
I'm trying to access a permanent Hive UDF,
scala val data = hc.sql(select func.md5(some_string) from some_table)
data: org.apache.spark.sql.SchemaRDD =
SchemaRDD[19] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
java.lang.RuntimeException: Couldn't find function func.md5
Problem solved. A simple join will do the work
val prefix = new PairRDDFunctions[Int, Set[Int]](sc.parallelize(List((9,
Set(4)), (1,Set(3)), (2,Set(5)), (2,Set(4)
val suffix = sc.parallelize(List((1, Set(1)), (2, Set(6)), (2, Set(5)), (2,
Set(7
How to save RDD with gzip compression?
Thanks.
Thanks Denny and Ismail.
Denny ,I went through your blog, It was great help. I guess tableau beta
connector also following the same procedure,you described in blog. I am
building the Spark now.
Basically what I don't get is, where to put my data so that tableau can extract.
So Ismail,its
Tableau connects to Spark Thrift Server via an ODBC driver. So, none of the
RDD stuff applies, you just issue SQL queries from Tableau.
The table metadata can come from Hive Metastore if you place your
hive-site.xml to configuration directory of Spark.
On Thu, Feb 5, 2015 at 8:11 AM, ashu
Is YARN_CONF_DIR set?
--- Original Message ---
From: Aniket Bhatnagar aniket.bhatna...@gmail.com
Sent: February 4, 2015 6:16 AM
To: kundan kumar iitr.kun...@gmail.com, spark users
user@spark.apache.org
Subject: Re: Spark Job running on localhost on yarn cluster
Have you set master in
The context is that you would create your RDDs and then persist them in
Hive. Once in Hive, the data is accessible from the Tableau extract through
Spark thrift server.
On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030)
ashutosh.triv...@iiitb.org wrote:
Thanks Denny and Ismail.
Hope someone helps me. Thanks.
On Wed, Feb 4, 2015 at 6:14 PM, Jander g jande...@gmail.com wrote:
We have a standalone spark cluster for kerberos test.
But when reading from hdfs, i get error output: Can't get Master Kerberos
principal for use as renewer.
So Whether standalone spark
I'm sitting here looking at my application crunching gigabytes of data on a
cluster and I have no idea if it's an hour away from completion or a
minute. The web UI shows progress through each stage, but not how many
stages remaining. How can I work out how many stages my program will take
Hi!
We have a common usecase with Spark - we go out to some database, e.g.
Cassandra, crunch though all of its data, but along the RDD pipeline we use a
pipe operator to some script. All the data before the pipe has some unique IDs,
but inside the pipe everything is lost.
The only current
It means you did not exclude the Servlet APIs from some dependency in
your app, and one of them is bringing it in every time. Look at the
dependency tree and exclude whatever brings in javax.servlet.
It should be already available in Spark, and the particular
javax.servlet JAR from Oracle has
Hi,
I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit --class EDDApp
target/scala-2.10/edd-application_2.10-1.0.jar --master yarn-cluster
--num-executors 3 --driver-memory 6g --executor-memory 7g outpuPath
But, I can see that this
Bo yang-
I am using Spark 1.2.0 and undoubtedly there are older Guava classes which
are being picked up and serialized with the closures when they are sent
from the driver to the executors because the class serial version ids don't
match from the driver to the executors. Have you tried doing
It means what it says. You should not have multiple SparkContexts
running in one JVM. It was always the wrong thing to do, but now is an
explicit error. When you run spark-shell, you already have a
SparkContext (sc) so there is no need to make another one. Just don't
do that.
On Wed, Feb 4, 2015
I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
I have 5 slaves:
(32cores /256Gb / 7physical disks) x 5
I have been trying many different configurations with YARN.
yarn.nodemanager.resource.memory-mb 196Gb
yarn.nodemanager.resource.cpu-vcores 24
I have tried to execute the
We have a standalone spark cluster for kerberos test.
But when reading from hdfs, i get error output: Can't get Master Kerberos
principal for use as renewer.
So Whether standalone spark support kerberos? can anyone confirm it? or
what i missed?
Thanks in advance.
--
Thanks,
Jander
Hi Mridul,
do you think you'll keep working on this, or should this get picked up by
others? Looks like there was a lot of work put into LargeByteBuffer, seems
promising.
thanks,
Imran
On Tue, Feb 3, 2015 at 7:32 PM, Mridul Muralidharan mri...@gmail.com
wrote:
That is fairly out of date (we
I think you are interested in secondary sort, which is still being worked
on:
https://issues.apache.org/jira/browse/SPARK-3655
On Tue, Feb 3, 2015 at 4:41 PM, Nitin kak nitinkak...@gmail.com wrote:
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I
yeah i think we have been lucky so far. but i dont really see how i have a
choice. it would be fine if say hadoop exposes a very small set of
libraries as part of the classpath. but if i look at the jars on hadoop
classpath its a ton! and why? why does parquet need to be included with
hadoop for
Thanks Akhil for mentioning this Low Level Consumer (
https://github.com/dibbhatt/kafka-spark-consumer ) . Yes it has better
fault tolerant mechanism than any existing Kafka consumer available . This
has no data loss on receiver failure and have ability to reply or restart
itself in-case of
58 matches
Mail list logo