local[3] spawns 3 threads on 1 core :)
Thanks
Best Regards
On Fri, Feb 20, 2015 at 12:50 PM, bit1...@163.com bit1...@163.com wrote:
Thanks Akhil, you are right.
I checked and find that I have only 1 core allocated to the program
I am running on a visual machine,and only allocate one
Hi,
What is the file format which is used to write files while shuffle write?
Is it dependent on the spark shuffle manager or output format?
Is it possible to change the file format for shuffle, irrespective of the
output format of the file?
Thanks,
Twinkle
Hi All,
Thanks for your answers.
I have one more details to point out.
It is clear now how partition number is defined for HDFS file,
However, if i have my dataset replicated on all the machines in the same
absolute path.
In this case each machine has for instance ext3 filesystem.
If i load
On Thu, Feb 19, 2015 at 2:49 PM, John Omernik j...@omernik.com wrote:
I am running Spark on Mesos and it works quite well. I have three
users, all who setup iPython notebooks to instantiate a spark instance
to work with on the notebooks. I love it so far.
Since I am auto instantiating (I
None of this really points to the problem. These indicate that workers
died but not why. I'd first go locate executor logs that reveal more
about what's happening. It sounds like a hard-er type of failure, like
JVM crash or running out of file handles, or GC thrashing.
On Fri, Feb 20, 2015 at
On Spark 1.2:
I am trying to capture # records read from a kafka topic:
val inRecords = ssc.sparkContext.accumulator(0, InRecords)
..
kInStreams.foreach( k =
{
k.foreachRDD ( rdd = inRecords += rdd.count().toInt )
inRecords.value
Question
Hello
Question regarding the new DataFrame API introduced here
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
I oftentimes use the zipWithUniqueId method of the SchemaRDD (as an RDD) to
replace string keys with more efficient long keys.
Hello Baris,
Giving your complete source code (if not very long, or maybe via
https://gist.github.com/) could be more helpful.
Also telling which Spark version you use, on which file system, and how you
run your application, together with the any log / output info it produces
might make
Although I don't know if it's related, the Class.forName() method of
loading drivers is very old. You should be using DataSource and
javax.sql; this has been the usual practice since about Java 1.4.
Why do you say a different driver is being loaded? that's not the error here.
Try instantiating
Hello,
Could you please add Big Industries to the Powered by Spark page at
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ?
Company Name: Big Industries
URL: http://http://www.bigindustries.be/
Spark Components: Spark Streaming
Use Case: Big Content Platform
Summary:
Thanks Akhil.
From: Akhil Das
Date: 2015-02-20 16:29
To: bit1...@163.com
CC: user
Subject: Re: Re: Spark streaming doesn't print output when working with
standalone master
local[3] spawns 3 threads on 1 core :)
Thanks
Best Regards
On Fri, Feb 20, 2015 at 12:50 PM, bit1...@163.com
Hi there,
I try to increase the number of executors per worker in the standalone mode
and I have failed to achieve that.
I followed a bit the instructions of this thread:
http://stackoverflow.com/questions/26645293/spark-configuration-memory-instance-cores
and did that:
spark.executor.memory 1g
Hello,
We are building a Spark Streaming application that listens to a directory
on HDFS, and uses the SolrJ library to send newly detected files to a Solr
server. When we put 10.000 files to the directory it is listening to, it
starts to process them by sending the files to our Solr server but
Hi
Since we're approaching the GSOC2015 application process I have some
questions:
1) Will your organization be a part of GSOC2015 and what are the projects
that you will be interested in?
2) Since I'm not a contributor to apache spark, what are some starter tasks
I can work on to gain facility
well, I understand the math (having two vectors) but the python
MatrixFactorizationModel object seems to be just a wrapper around java class so
not sure how to extract the two RDDs?thx,Antony.
On Thursday, 19 February 2015, 16:32, Ilya Ganelin ilgan...@gmail.com
wrote:
Yep. the
Thanks you all. Just changing RDD to Map structure saved me approx. 1
second.
Yes, I will check out IndexedRDD to see if it has better performance.
best,
/Shahab
On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz brk...@gmail.com wrote:
If your dataset is large, there is a Spark Package called
I definitely delete the file on the right HDFS, I only have one HDFS instance.
The problem seems to be in the CassandraRDD - reading always fails in some way
when run on the cluster, but single-machine reads are okay.
On Feb 20, 2015, at 4:20 AM, Ilya Ganelin ilgan...@gmail.com wrote:
The
Baris,
I've tried the following piece of code:
https://gist.github.com/emres/10c509c1d69264fe6fdb
and built it using
sbt package
and then submitted it via
spark-submit --class
org.apache.spark.examples.mllib.StreamingLinearRegression --master local[4]
Awesome! This is exactly what I'd need. Unfortunately, I am not a
programmer of any talent or skill, but how could I assist with this
JIRA? From a User perspective, this is really the next step for my org
taking our Mesos cluster to user land with Spark. I don't want to be
pushy, but is there any
Hi,
Probably this is silly question, but I couldn't find any clear
documentation explaining why one should submitting... missing tasks from
Stage ... in the logs?
Specially in my case when I do not have any failure in job execution, I
wonder why this should happen?
Does it have any relation to
Hi Emre,
Have you tried adjusting these:
.set(spark.akka.frameSize, 500).set(spark.akka.askTimeout,
30).set(spark.core.connection.ack.wait.timeout, 600)
-Todd
On Fri, Feb 20, 2015 at 8:14 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
We are building a Spark Streaming application that
For a given batch, for a given partition, the messages will be processed in
order by the executor that is running that partition. That's because
messages for the given offset range are pulled by the executor, not pushed
from some other receiver.
If you have speculative execution, yes, another
Hi
1.*To get a taste* of my talk at the 2015 Hadoop Summit, please find below a
few links to a similar talk that I gave at the Chicago Hadoop Users Group on
‘ *Transitioning Compute Models: Apache MapReduce to Spark*’ on February 12,
2015 in front of 185 attendees:
- Video Recording:
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance should be improved in 1.3.0. -Xiangrui
On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
Hi Ilya,
thanks for your insight, this was the right clue. I had default parallelism
already
Hi,
I am new to Spark and I think I missed something very basic.
I have the following use case (I use Java and run Spark locally on my
laptop):
I have a JavaRDDString[]
- The RDD contains around 72,000 arrays of strings (String[])
- Each array contains 80 words (on average).
What I want to
Thanks Marcelo! I will try to change the log4j.properties
On Fri, Feb 20, 2015 at 11:37 AM, Marcelo Vanzin van...@cloudera.com
wrote:
Hi Anny,
You could play with creating your own log4j.properties that will write
the output somewhere else (e.g. to some remote mount, or remote
syslog).
Correction,
should be HADOOP_CONF_DIR=/etc/hive/conf spark-shell --driver-class-path
'/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
--driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/
parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
On Fri, Feb 20, 2015
Correction,
should be HADOOP_CONF_DIR=/etc/hive/conf --driver-class-path
'/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
--driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/
parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
On Fri, Feb 20, 2015 at 3:43 PM,
A single vector of size 10^7 won't hit that bound. How many clusters
did you set? The broadcast variable size is 10^7 * k and you can
calculate the amount of memory it needs. Try to reduce the number of
tasks and see whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 7:20 PM, lihu
No problem, Antony. ML lib is tricky! I'd love to chat with you about your
use case - sounds like we're working on similar problems/scales.
On Fri, Feb 20, 2015 at 1:55 PM Xiangrui Meng men...@gmail.com wrote:
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance
That worked perfectly...thanks so much!
On Fri, Feb 20, 2015 at 3:49 PM, Sourigna Phetsarath
gna.phetsar...@teamaol.com wrote:
Correction,
should be HADOOP_CONF_DIR=/etc/hive/conf spark-shell --driver-class-path
'/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
Hi Sandy,
I appreciate your clear explanation. Let me try again. It's the best way to
confirm I understand.
spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
that YARN will create a JVM
spark.executor.memory = the memory I can actually use in my jvm application
= part of
Try it without
--master yarn-cluster
if you are trying to run a spark-shell. :)
On Fri, Feb 20, 2015 at 3:18 PM, chirag lakhani chirag.lakh...@gmail.com
wrote:
I tried
spark-shell --master yarn-cluster --driver-class-path
Oh no worries at all. If you want, I'd be glad to make updates and PR for
anything I find, eh?!
On Fri, Feb 20, 2015 at 12:18 Michael Armbrust mich...@databricks.com
wrote:
Yeah, sorry. The programming guide has not been updated for 1.3. I'm
hoping to get to that this weekend / next week.
Hi,
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy
Thanks! I am able to login to Spark now but I am still getting the same
error
scala sqlContext.sql(FROM analytics.trainingdatafinal SELECT
*).collect().foreach(println)
15/02/20 14:40:22 INFO ParseDriver: Parsing command: FROM
analytics.trainingdatafinal SELECT *
15/02/20 14:40:22 INFO
Also, you might want to add the hadoop configs:
HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf --driver-class-path
'/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
--driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/
That's all correct.
-Sandy
On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote:
Hi Sandy,
I appreciate your clear explanation. Let me try again. It's the best way
to confirm I understand.
spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
that
Hi Ilya,
thanks for your insight, this was the right clue. I had default parallelism
already set but it was quite low (hundreds) and moreover the number of
partitions of the input RDD was low as well so the chunks were really too big.
Increased parallelism and repartitioning seems to be
Are you using the capacity scheduler or fifo scheduler without multi
resource scheduling by any chance?
On Thu, Feb 12, 2015 at 1:51 PM, Anders Arpteg arp...@spotify.com wrote:
The nm logs only seems to contain similar to the following. Nothing else
in the same time range. Any help?
yeah, this is just the totally normal message when spark executes
something. The first time something is run, all of its tasks are
missing. I would not worry about cases when all tasks aren't missing
if you're new to spark, its probably an advanced concept that you don't
care about. (and would
I didn’t get any response. It’d be really appreciated if anyone using a special
OutputCommitter for S3 can comment on this!
Thanks,
Mingyu
From: Mingyu Kim m...@palantir.commailto:m...@palantir.com
Date: Monday, February 16, 2015 at 1:15 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Thanks for the suggestions.
I'm experimenting with different values for spark memoryOverhead and
explictly giving the executors more memory, but still have not found the
golden medium to get it to finish in a proper time frame.
Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
Trying to
We (Databricks) use our own DirectOutputCommitter implementation, which is
a couple tens of lines of Scala code. The class would almost entirely be a
no-op except we took some care to properly handle the _SUCCESS file.
On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote:
I
Is there a check you can put in place to not create pairs that aren't in
your set of 20M pairs? Additionally, once you have your arrays converted to
pairs you can do aggregateByKey with each pair being the key.
On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote:
Hi,
I am new to Spark
Hi,
In the spark streaming application, I write the code,
FlumeUtils.createStream(ssc,localhost,),which means spark will listen on
the port, and wait for Flume Sink to write to it.
My question is: when I submit the application to the Spark Standalone cluster,
will be opened only
Is there a technique for forcing the evaluation of an RDD?
I have used actions to do so but even the most basic count has a
non-negligible cost (even on a cached RDD, repeated calls to count take
time).
My use case is for logging the execution time of the major components in my
application. At
Hello,
I have a few tasks in a stage with lots of tasks that have a large amount
of shuffle spill.
I scouted the web to understand shuffle spill, and I did not find any
simple explanation of the spill mechanism. What I put together is:
1. the shuffle spill can happens when the shuffle is
Sean,
I know that Class.forName is not required since Java 1.4 :-) It was just a
desperate attempt to make sure that the Postgres driver is getting loaded.
Since Class.forName(org.postgresql.Driver) is not throwing an exception, I
assume that the driver is available in the classpath. Is that
You may think as well if your use case really needs a very strict order,
because configuring spark that it supports such a strict order means
rendering most of benefits useless (failure handling, parallelism etc.).
Usually, in a distributed setting you can order events, but this also means
that
Thanks Jorn. Indeed, we do not need global ordering, since our data is
partitioned well. We do not need ordering based on wallclock time, that
would require waiting indefinitely. All we need is the execution of
batches (not job submission) to happen in the same order they are
generated, which
Thanks for the detailed response Cody. Our use case is to do some external
lookups (cached and all) for every event, match the event against the
looked up data, decide whether to write an entry in mysql and write it in
the order in which the events arrived within a kafka partition.
We don't need
There is typically some slack between when a batch finishes executing and
when the next batch is scheduled. You should be able to arrange your batch
sizes / cluster resources to ensure that. If there isn't slack, your
overall delay is going to keep increasing indefinitely.
If you're inserting
Has anyone found a solution to this? I was able to reproduce it here
http://stackoverflow.com/questions/28576439/getting-nosuchmethoderror-when-setting-up-spark-graphx-graph
but I'm unable to resolve it.
--
View this message in context:
Hi!
I am trying to persist RDD in Avro format with spark API. I wonder if
someone has any experience or suggestions.
My converter with example can be viewed here
https://github.com/daria-sukhareva/spark/commit/2ba7b213572d6ce2056cfc2536b701ae689c7f98
and relevant question here
A bit more context on this issue. From the container logs on the executor
Given my cluster specs above what would be appropriate parameters to pass
into :
--num-executors --num-cores --executor-memory
I had tried it with --executor-memory 2500MB
015-02-20 06:50:09,056 WARN
Are you specifying the executor memory, cores, or number of executors
anywhere? If not, you won't be taking advantage of the full resources on
the cluster.
-Sandy
On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote:
None of this really points to the problem. These indicate
Have a look at spark.yarn.user.classpath.first and
spark.files.userClassPathFirst for a possible way to give your copy of
the libs precedence.
On Fri, Feb 20, 2015 at 5:20 PM, Mohammed Guller moham...@glassbeam.com wrote:
Sean,
I know that Class.forName is not required since Java 1.4 :-) It was
Hi Mohammed,
thanks a lot for the reply.
Ok, so from what I understand I cannot control the number of executors per
worker in standalone cluster mode.
Is that correct?
BR
On 20 February 2015 at 17:46, Mohammed Guller moham...@glassbeam.com
wrote:
SPARK_WORKER_MEMORY=8g
Will allocate 8GB
If that's the error you're hitting, the fix is to boost
spark.yarn.executor.memoryOverhead, which will put some extra room in
between the executor heap sizes and the amount of memory requested for them
from YARN.
-Sandy
On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:
A
SPARK_WORKER_MEMORY=8g
Will allocate 8GB memory to Spark on each worker node. Nothing to do with # of
executors.
Mohammed
From: Yiannis Gkoufas [mailto:johngou...@gmail.com]
Sent: Friday, February 20, 2015 4:55 AM
To: user@spark.apache.org
Subject: Setting the number of executors in standalone
Hi all,
Wanted to let you know I've forked PySpark Cassandra on
https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the
original code didn't work for me and I couldn't figure out how it could
work. But it inspired! so I rewrote the majority of the project.
The rewrite implements
Hi,
I have a use case for creating a DStream from a single file. I have created
a custom receiver that reads the file, calls 'store' with the contents, then
calls 'stop'. However, I'm second guessing if this is the correct approach
due to the spark logs I see.
I always see these logs, and the
ASFAIK, in stand-alone mode, each Spark application gets one executor on each
worker. You could run multiple workers on a machine though.
Mohammed
From: Yiannis Gkoufas [mailto:johngou...@gmail.com]
Sent: Friday, February 20, 2015 9:48 AM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject:
SPARK_CLASSPATH has been deprecated since 1.0. In any case, I tired and it
didn't work since it appends to the classpath. I need something that prepends
to the classpath.
Mohammed
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Friday, February 20, 2015 10:08 AM
Quickly reviewing the latest SQL Programming Guide
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
(in github) I had a couple of quick questions:
1) Do we need to instantiate the SparkContext as per
// sc is an existing SparkContext.
val sqlContext = new
It looks like spark.files.userClassPathFirst gives precedence to user libraries
only on the worker nodes. Is there something similar to achieve the same
behavior on the master?
BTW, I am running Spark in stand-alone mode.
Mohammed
-Original Message-
From: Sean Owen
Hm, others can correct me if I'm wrong, but is this what SPARK_CLASSPATH is for?
On Fri, Feb 20, 2015 at 6:04 PM, Mohammed Guller moham...@glassbeam.com wrote:
It looks like spark.files.userClassPathFirst gives precedence to user
libraries only on the worker nodes. Is there something similar
Hi,
Currently, there is only one executor per worker. There is jira ticket to
relax this:
https://issues.apache.org/jira/browse/SPARK-1706
But, if you want to use more cores, maybe, you can try increasing
SPARK_WORKER_INSTANCES. It increases the number of workers per machine.
Take a look here:
Hi Anny,
You could play with creating your own log4j.properties that will write
the output somewhere else (e.g. to some remote mount, or remote
syslog). Sorry, but I don't have an example handy.
Alternatively, if you can use Yarn, it will collect all logs after the
job is finished and make them
Hi,
I am wondering if there's some way that could lead some of the worker stdout
to one place instead of in each worker's stdout. For example, I have the
following code
RDD.foreach{line =
try{
do something
}catch{
case e:exception = println(line)
}
}
Every time I want to check what's causing
We have a sophisticated Spark Streaming application that we have been using
successfully in production for over a year to process a time series of
events. Our application makes novel use of updateStateByKey() for state
management.
We now have the need to perform exactly the same processing on
I am trying to access a hive table using spark sql but I am having
trouble. I followed the instructions in a cloudera community board which
stated
1) Import hive jars into the class path
export SPARK_CLASSPATH=$(find
/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/ -name
Hi Sandy,
I am also doing memory tuning on YARN. Just want to confirm, is it correct
to say:
spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I
can actually use in my jvm application
If it is not, what is the correct relationship? Any other variables or
config parameters
Hi Kelvin,
spark.executor.memory controls the size of the executor heaps.
spark.yarn.executor.memoryOverhead is the amount of memory to request from
YARN beyond the heap size. This accounts for the fact that JVMs use some
non-heap memory.
The Spark heap is divided into
Chirag,
This worked for us:
spark-submit --master yarn-cluster --driver-class-path
'/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*'
...
Let me know, if you have any issues.
On Fri, Feb 20, 2015 at 2:43
Yeah, sorry. The programming guide has not been updated for 1.3. I'm
hoping to get to that this weekend / next week.
On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote:
Quickly reviewing the latest SQL Programming Guide
I tried
spark-shell --master yarn-cluster --driver-class-path
'/data/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
--driver-java-options
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hive/lib/*'
and I get the following error
Error: Cluster
78 matches
Mail list logo