Thanks RK. I can turn on speculative execution but I am trying to find out
actual reason for delay as it happens on any node. Any idea about the stack
trace in my previous mail.
Regards,Ajay
On Thursday, January 15, 2015 8:02 PM, RK prk...@yahoo.com.INVALID wrote:
If you don't want
Yes that's a typo. The API docs and source code are correct though.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
That and your IDE should show the correct signature. You can open a PR
to fix the typo in
Replying to all
Is this Overhead memory allocation used for any specific purpose.
For example, will it be any different if I do *--executor-memory 22G *with
overhead set to 0%(hypothetically) vs
*--executor-memory 20G* and overhead memory to default(9%) which
eventually brings the total
Hi,
I am run PageRank on a large dataset, which include 200 million nodes and 2
billion edges?
Isspark suitable for large scale pagerank? How many cores and MEM do I need and
how long will it take?
Thanks
Xuewei Tang
You can also check rdd.partitions.size. This will be 0 for empty RDDs and
0 for RDDs with data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21170.html
Sent from the Apache Spark User List mailing list archive at
What's the version of Spark you are using?
On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw linda.terl...@icris.nl wrote:
I'm new to Spark. When I use the Movie Lens dataset 100k
(http://grouplens.org/datasets/movielens/), Spark crashes when I run the
following code. The first call to
If you give the executor 22GB, it will run with ... -Xmx22g. If the JVM
heap gets nearly full, it will almost certainly consume more than 22GB of
physical memory, because the JVM needs memory for more than just heap. But
in this scenario YARN was only asked for 22GB and it gets killed. This is
Ajay,
Unless we are dealing with some synchronization/conditional variable
bug in Spark, try this per tuning guide:
Cache Size Tuning
One important configuration parameter for GC is the amount of memory that
should be used for caching RDDs. By default, Spark uses 60% of the configured
Hi all,
Any help on the following is very much appreciated.
=
Problem:
On a schemaRDD read from a parquet file (data within file uses AVRO model)
using the HiveContext:
I can't figure out how to 'select' or use 'where' clause, to filter
rows on a field that
The AMPLab maintains a bunch of Docker files for Spark here:
https://github.com/amplab/docker-scripts
Hasn't been updated since 1.0.0, but might be a good starting point.
On Wed Jan 14 2015 at 12:14:13 PM Josh J joshjd...@gmail.com wrote:
We have dockerized Spark Master and worker(s)
This is a YARN setting. It just controls how much any container can
reserve, including Spark executors. That is not the problem.
You need Spark to ask for more memory from YARN, on top of the memory that
is requested by --executor-memory. Your output indicates the default of 7%
is too little. For
You could try yo use hive context which bring HiveQL, it would allow you to
query nested structures using LATERAL VIEW explode...
On Jan 15, 2015 4:03 PM, jvuillermet jeremy.vuiller...@gmail.com wrote:
let's say my json file lines looks like this
{user: baz, tags : [foo, bar] }
Just throwing this out here, there is existing PR to add docker support for
spark framework to launch executors with docker image.
https://github.com/apache/spark/pull/3074
Hopefully this will be merged sometime.
Tim
On Thu, Jan 15, 2015 at 9:18 AM, Nicholas Chammas
nicholas.cham...@gmail.com
Have you seen http://search-hadoop.com/m/JW1q5pE3P12 ?
Please also take a look at the end-to-end performance graph on
http://spark.apache.org/graphx/
Cheers
On Thu, Jan 15, 2015 at 9:29 AM, txw t...@outlook.com wrote:
Hi,
I am run PageRank on a large dataset, which include 200 million
let's say my json file lines looks like this
{user: baz, tags : [foo, bar] }
sqlContext.jsonFile(data.json)
...
How could I query for user with bar tags using SQL
sqlContext.sql(select user from users where tags ?contains? 'bar' )
I could simplify the request and use the returned RDD to
thanks
On Thu, Jan 15, 2015 at 7:35 PM, Prannoy [via Apache Spark User List]
ml-node+s1001560n21163...@n3.nabble.com wrote:
Hi,
You can take the schema line in another rdd and than do a union of the two
rdd .
ListString schemaList = new ArrayListString;
schemaList.add(xyz);
// where
Those settings aren't relevant, I think. You're concerned with what
your app requests, and what Spark requests of YARN on your behalf. (Of
course, you can't request more than what your cluster allows for a
YARN container for example, but that doesn't seem to be what is
happening here.)
You do not
I am sorry for the formatting error, the value for
*yarn.scheduler.maximum-allocation-mb
= 28G*
On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak nitinkak...@gmail.com wrote:
Thanks for sticking to this thread.
I am guessing what memory my app requests and what Yarn requests on my
part should be
Thanks for sticking to this thread.
I am guessing what memory my app requests and what Yarn requests on my part
should be same and is determined by the value of *--executor-memory* which
I had set to *20G*. Or can the two values be different?
I checked in Yarn configurations(below), so I think
Hi,
You can use FileUtil.copyMerge API and specify the path to the folder where
saveAsTextFile is save the part text file.
Suppose your directory is /a/b/c/
use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of
destination, Path to the merged file say (a/b/c.txt), true(to delete the
cogroup() function seems to return (K, (IterableV, IterableW)), rather than
(K, IterableV, IterableW), as it is pointed out in the docs (at least for
version 1.1.0):
https://spark.apache.org/docs/1.1.0/programming-guide.html
https://spark.apache.org/docs/1.1.0/programming-guide.html
This
You're specifying the queue in the spark-submit command line:
--queue thequeue
Are you sure that queue exists?
On Thu, Jan 15, 2015 at 11:23 AM, Manoj Samel manojsamelt...@gmail.com wrote:
Hi,
Setup is as follows
Hadoop Cluster 2.3.0 (CDH5.0)
- Namenode HA
- Resource manager HA
-
yeah that's where I ended up. Thanks ! I'll give it a try.
On Thu, Jan 15, 2015 at 8:46 PM, Ayoub [via Apache Spark User List]
ml-node+s1001560n21172...@n3.nabble.com wrote:
You could try to use hive context which bring HiveQL, it would allow you
to query nested structures using LATERAL VIEW
Hi All,
I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.
But the command line I use to start up spark-shell, it can't work. For
example:
~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar
You could try to use hive context which bring HiveQL, it would allow you to
query nested structures using LATERAL VIEW explode...
see doc
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
here
--
View this message in context:
Hi All,
I try to clarify some behavior in the spark for executor. Because I am from
Hadoop background, so I try to compare it to the Mapper (or reducer) in
hadoop.
1, Each node can have multiple executors, each run in its own process? This
is same as mapper process.
2, I thought the
I have a standalone spark cluster with only one node with 4 CPU cores. How can
I force spark to do parallel processing of my RDD using multiple threads? For
example I can do the following
Spark-submit --master local[4]
However I really want to use the cluster as follow
Spark-submit --master
I think Sampo's thought is to get a function that only tests if a RDD is
empty. He does not want to know the size of the RDD, and getting the size of
a RDD is expensive for large data sets.
I myself saw many times that my app threw out exceptions because an empty
RDD cannot be saved. This is not
and to make things even more interesting:
The CDH *5.3* version of Spark 1.2 differs from the Apache Spark 1.2
release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH
5.2. Apache Spark 1.2 uses Akka version 2.3.4.
so i just compiled a program that uses akka against apache
thanks for the replies. very useful.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
just discovered that the streaming folder in pyspark is not included in the
assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the
python folder. Any reason why?
Thanks,
--
View this message in context:
I figure out the second question, because if I don't pass in the num of
partition for the test data, it will by default assume has max executors
(although I don't know what is this default max num).
val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,
Hi, BB
Ideally you can do the query like: select key, value.percent from
mytable_data lateral view explode(audiences) f as key, value limit 3;
But there is a bug in HiveContext:
https://issues.apache.org/jira/browse/SPARK-5237
I am working on it now, hopefully make a patch soon.
Cheng
I have seen this happen when the RDD contains null values. Essentially,
saveAsTextFile calls toString() on the elements of the RDD, so a call to
null.toString will result in an NPE.
--
View this message in context:
say there's some logs:
s3://log-collections/sys1/20141212/nginx.gz
s3://log-collections/sys1/20141213/nginx-part-1.gz
s3://log-collections/sys1/20141213/nginx-part-2.gz
I have a function that parse the logs for later analysis.
I want to parse all the files. So I do this:
logs =
Hi,
On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote:
I myself saw many times that my app threw out exceptions because an empty
RDD cannot be saved. This is not big issue, but annoying. Having a cheap
solution testing if an RDD is empty would be nice if there is no such
An executor is specific to a Spark application, just as a mapper is
specific to a MapReduce job. So a machine will usually be running many
executors, and each is a JVM.
A Mapper is single-threaded; an executor can run many tasks (possibly
from different jobs within the application) at once. Yes,
How about checking whether take(1).length == 0? If I read the code
correctly, this will only examine the first partition, at least.
On Fri, Jan 16, 2015 at 4:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote:
I myself
You're understanding is basically correct. Each task creates it's own
local accumulator, and just those results get merged together.
However, there are some performance limitations to be aware of. First you
need enough memory on the executors to build up whatever those intermediate
results are.
Thanks Nicos.GC does not contribute much to the execution time of the task. I
will debug it further today.
Regards,Ajay
On Thursday, January 15, 2015 11:55 PM, Nicos n...@hotmail.com wrote:
Ajay, Unless we are dealing with some synchronization/conditional variable bug
in Spark, try
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray.
We are using Kryo serializer for the Avro objects read from Parquet and we
are using our custom Kryo registrator (along the lines of ADAM
I have document storage services in Accumulo that I'd like to expose to
Spark SQL. I am able to push down predicate logic to Accumulo to have it
perform only the seeks necessary on each tablet server to grab the results
being asked for.
I'm interested in using Spark SQL to push those predicates
Hi Dibyendu,
I am using kafka 0.8.1.1 and spark 1.2.0.
After modifying these version of your pom, I have rebuilt your codes.
But I have not got any messages from ssc.receiverStream(new
KafkaReceiver(_props, i)).
I have found, in your codes, all the messages are retrieved correctly, but
Hi Kidong,
No , I have not tried yet with Spark 1.2 yet. I will try this out and let
you know how this goes.
By the way, is there any change in Receiver Store method happened in Spark
1.2 ?
Regards,
Dibyendu
On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:
Hi
Did you try increasing the parallelism?
Thanks
Best Regards
On Fri, Jan 16, 2015 at 10:41 AM, Anand Mohan chinn...@gmail.com wrote:
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray.
We are using Kryo serializer for the Avro objects read from Parquet and we
are using
There's a JSON end point in the Web UI ( that running on port 8080),
http://masterip:8080/json/
Thanks
Best Regards
On Thu, Jan 15, 2015 at 6:30 PM, Shing Hing Man mat...@yahoo.com.invalid
wrote:
Hi,
I am using Spark 1.2. The Spark master UI has a status.
Is there a web service on the
You can use the saveAsNewAPIHadoop
http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile
file. You can use it for compressing your output, here's a sample code
https://github.com/ScrapCodes/spark-1/blob/master/python/pyspark/tests.py#L1225
to use the API.
Hi Kidong,
Just now I tested the Low Level Consumer with Spark 1.2 and I did not see
any issue with Receiver.Store method . It is able to fetch messages form
Kafka.
Can you cross check other configurations in your setup like Kafka broker IP
, topic name, zk host details, consumer id etc.
Dib
The Data Source API probably work for this purpose.
It support the column pruning and the Predicate Push Down:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
Examples also can be found in the unit test:
Hi,
When I am trying to run a program in a remote spark machine I am getting
this below exception :
15/01/16 11:14:39 ERROR UserGroupInformation: PriviledgedActionException
as:user1 (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures
timed out after [30 seconds]
Exception in
There was a simple example
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
which you can run after changing few lines of configurations.
Thanks
Best Regards
On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya
Hi,
I am experiencing a weird error that suddenly popped up in my unit tests. I
have a couple of HDFS files in JSON format and my test is basically
creating a JsonRDD and then issuing a very simple SQL query over it. This
used to work fine, but now suddenly I get:
15:58:49.039 [Executor task
Thanks Cheng!
Is there any API I can get access too (e.g. ParquetTableScan) which would allow
me to load up the low level/baseRDD of just RDD[Row] so I could avoid the
defensive copy (maybe lose our on columnar storage etc.).
We have parts of our pipeline using SparkSQL/SchemaRDDs and others
Scala for-loops are implemented as closures using anonymous inner classes
which are instantiated once and invoked many times. This means, though,
that the code inside the loop is actually sitting inside a class, which
confuses Spark's Closure Cleaner, whose job is to remove unused references
from
Hi
I want to visualize tasks and stages in order to analyze spark jobs.
I know necessary metrics is written in spark.eventLog.dir.
Does anyone know the tool like swimlanes in Tez?
Regards,
Nobuyuki Kuromatsu
-
To unsubscribe,
Hi,
I'm new to Apache Storm.
I'm receiving data at my UDP port 8060, I want to capture it and perform
some operations in the real time, for which I'm using Spark Streaming. While
the code seems to be correct, I get the following output:
https://gist.github.com/d34th4ck3r/0e88896eac864d6d7193
A recent discussion says these won't be public. However there are many
optimized collection libs in Java. My favorite is Koloboke:
https://github.com/OpenHFT/Koloboke/wiki/Koloboke:-roll-the-collection-implementation-with-features-you-need
Carrot HPPC is good too. The only catch is that the
Same problem here...
Did u find a solution for this?
P.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ScalaReflectionException-when-using-saveAsParquetFile-in-sbt-tp21020p21150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm quite interested in how Spark's fault tolerance works and I'd like to
ask a question here.
According to the paper, there are two kinds of dependencies--the wide
dependency and the narrow dependency. My understanding is, if the
operations I use are all narrow, then when one machine
Hi, Expert I want to consumes data from kinesis stream using spark streaming.
I am trying to create kinesis stream using scala code. Here is my code
def main(args: Array[String]) {
println(Stream creation started)
if(create(2))
println(Stream is created successfully)
Hi,
I've searched but can't seem to find a PySpark example. How do I write
compressed text file output to S3 using PySpark saveAsTextFile?
Thanks,
Tom
I am also getting the same error after 1.2 upgrade.
application is crashing on this line
rdd.registerTempTable(temp)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21152.html
Sent from the Apache Spark User List
I found this, which might be useful:
https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala
I seems that forking is needed.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21153.html
Sent
Aaron,
thanks for your mail!
On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote:
Scala for-loops are implemented as closures using anonymous inner classes
[...]
While loops, on the other hand, involve none of this trickery, and
everyone is happy.
Ah, I was suspecting
You can always define an RDD transpose function yourself. This is what I use in
PySpark to transpose an RDD of numpy vectors. It’s not optimal and the vectors
need to fit in memory on the worker nodes.
def rddTranspose(rdd):
# add an index to the rows and the columns, result in triplet
I'm interested too and don't know for sure but I do not think this case is
optimized this way. However if you know your keys aren't split across
partitions and you have small enough partitions you can implement the same
grouping with mapPartitions and Scala.
On Jan 15, 2015 1:27 AM, Tobias
I have been using Spark sql in cluster mode and I am noticing no distribution
and parallelization of the query execution. The performance seems to be very
slow compared to native spark applications and does not offer any speedup
when compared to HIVE. I am using Spark 1.1.0 with a cluster of 5
Hi,
I am using Spark 1.2. The Spark master UI has a status.Is there a web
service on the Spark master that returns the status of the cluster in Json ?
Alternatively, what is the best way to determine if a cluster is up.
Thanks in advance for your assistance!
Shing
Are you using spark in standalone mode or yarn or mesos? If its yarn,
please mention the hadoop distribution and version. What spark distribution
are you using (it seems 1.2.0 but compiled with which hadoop version)?
Thanks, Aniket
On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid
Hi,
I'm trying to use the native blas, and I followed all the threads I saw here
and I still can't get rid of those warning:
WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
WARN netlib.BLAS: Failed to load implementation from:
I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory. However, passing multiple paths
separated with commas didn't work.
Is there any way to load all avro files in multiple directories using
sqlContext.avroFile?
On Wed, Jan 14, 2015 at
added fork :=true in Scala Build.
Commandline sbt is working fine but Eclipse SCALA IDE is still giving same
error.
This was all working fine untill Spark 1.1.
--
View this message in context:
hi experts!
I hav an RDD[String] and i want to add schema line at beginning in this rdd.
I know RDD is immutable. So is there anyway to have a new rdd with one
schema line and contents of previous rdd?
Thanks
--
View this message in context:
In the spark job server* bin *folder, you will find* application.conf*
file, put
context-settings {
spark.cassandra.connection.host = ur address
}
Hope this should work
--
View this message in context:
Hi,
My spark job is taking long time. I see that some tasks are taking longer time
for same amount of data and shuffle read/write. What could be the possible
reasons for it ?
The thread-dump sometimes show that all the tasks in an executor are waiting
with following stack trace -
Executor task
Hi,
Before saving the rdd do a collect to the rdd and print the content of the
rdd. Probably its a null value.
Thanks.
On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List]
ml-node+s1001560n20953...@n3.nabble.com wrote:
If you can paste the code here I can certainly
Found a setting that seems to fix this problem, but it does not seems to be
available until Spark 1.3. See
https://issues.apache.org/jira/browse/SPARK-2165
However, glad to see a work is being done with the issue.
On Tue, Jan 13, 2015 at 8:00 PM, Anders Arpteg arp...@spotify.com wrote:
Yes
Hi,
Setup is as follows
Hadoop Cluster 2.3.0 (CDH5.0)
- Namenode HA
- Resource manager HA
- Secured with Kerberos
Spark 1.2
Run SparkPi as follows
- conf/spark-defaults.conf has following entries
spark.yarn.queue myqueue
spark.yarn.access.namenodes hdfs://namespace (remember this is namenode
Sure there is. Create a new RDD just containing the schema line (hint: use
sc.parallelize) and then union both the RDDs (the header RDD and data RDD)
to get a final desired RDD.
On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
hi experts!
I hav an RDD[String] and i
If you don't want a few slow tasks to slow down the entire job, you can turn on
speculation.
Here are the speculation settings from Spark Configuration - Spark 1.2.0
Documentation.
| |
| | | | | |
| Spark Configuration - Spark 1.2.0 DocumentationSpark Configuration Spark
Properties
Hi,
You can take the schema line in another rdd and than do a union of the two
rdd .
ListString schemaList = new ArrayListString;
schemaList.add(xyz);
// where xyz is your schema line
JavaRDD schemaRDDString = sc.parallize(schemaList) ;
//where sc is your sparkcontext
JavaRDD newRDDString =
Check the number of partitions in your input. It may be much less than
the available parallelism of your small cluster. For example, input
that lives in just 1 partition will spawn just 1 task.
Beyond that parallelism just happens. You can see the parallelism of
each operation in the Spark UI.
Maybe you are saying you already do this, but it's perfectly possible
to process as many RDDs as you like in parallel on the driver. That
may allow your current approach to eat up as much parallelism as you
like. I'm not sure if that's what you are describing with submit
multi applications but you
Thanks to Aniket’s work there is two new options to the EMR install script for
Spark. See
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md
The “-a” option can be used to bump the spark-assembly to the front of the
classpath.
-Christopher
From: Aniket Bhatnagar
84 matches
Mail list logo