Hi Subshri,
You may find these 2 blog posts useful:
https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html
https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-spark.html
On Tue, Sep 22, 2015 at 11:54 PM, Subshiri S
Have you seen the Spark SQL paper?:
https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz
wysakowicz.da...@gmail.com wrote:
Hi,
thanks for answers. I have read answers you provided, but I rather look
for some materials on the
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote:
Hello . I am seeing some unexpected issues with achieving HDFS
data
locality. I expect the
Your point #1 is a bit misleading.
(1) The mappers are not executed in parallel when processing
independently the same RDD.
To clarify, I'd say: In one stage of execution, when pipelining occurs,
mappers are not executed in parallel when processing independently the same
RDD partition.
On Thu,
Hi Igor Nirandap,
There is a setting in Spark called cores or num_cores that you should
look into. This # will set the # of threads running in each Executor JVM.
The name of the setting is a bit misleading. You don't have to match the
num_cores of the Executor to the actual number of CPU cores
Hi Marius,
Are you using the sort or hash shuffle?
Also, do you have the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?
How many partitions are in your RDDs before and after the problematic
shuffle
pradhandeep1...@gmail.com
wrote:
How is task slot different from # of Workers?
so don't read into any performance metrics you've collected to
extrapolate what may happen at scale.
I did not get you in this.
Thank You
On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
In general you should first figure out how many task slots are in the
cluster and then repartition the RDD to maybe 2x that #. So if you have a
100 slots, then maybe RDDs with partition count of 100-300 would be normal.
But also size of each partition can matter. You want a task to operate on a
Do you guys have dynamic allocation turned on for YARN?
Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?
If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the
TaskSetManager: Stage 1 contains a task of very
large size (9766 KB). The maximum recommended task size is 100 KB.
[1, 2, 3, 4, 5, 6, 7, 8, 9]
On Mon, Dec 15, 2014 at 1:33 PM, Sameer Farooqui same...@databricks.com
wrote:
Hi Genesis,
The 2nd case did work for me:
a = [1,2,3,4,5,6,7,8,9
How much executor-memory are you setting for the JVM? What about the Driver
JVM memory?
Also check the Windows Event Log for Out of memory errors for one of the 2
above JVMs.
On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote:
Hi,
My environment is: standalone spark 1.1.1 on
Instead of doing this on the compute side, I would just write out the file
with different blocks initially into HDFS and then use hadoop fs
-getmerge or HDFSConcat to get one final output file.
- SF
On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com wrote:
I have an RDD
Hi,
FYI - There are no Worker JVMs used when Spark is launched under YARN.
Instead the NodeManager in YARN does what the Worker JVM does in Spark
Standalone mode.
For YARN you'll want to look into the following settings:
--num-executors: controls how many executors will be allocated
Is the OOM happening to the Driver JVM or one of the Executor JVMs? What
memory size is each JVM?
How large is the data you're trying to broadcast? If it's large enough, you
may want to consider just persisting the data to distributed storage (like
HDFS) and read it in through the normal read RDD
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a
1-node m3.xlarge EC2 instance instance and the app finishes running
successfully in about 40 seconds. I just figured the 30 - 40 sec run time
was normal b/c of the submitting overhead that Andrew mentioned.
Denny, you can
Hi Ron,
Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.
Here's a quick test:
scala val firstRDD = sc.parallelize(1 to 5)
firstRDD:
%
On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui same...@databricks.com
wrote:
Have you tried taking thread dumps via the UI? There is a link to do so
on the Executors' page (typically under http://driver IP:4040/exectuors.
By visualizing the thread call stack of the executors with slow running
In general, most use cases don't need the RDD to be replicated in memory
multiple times. It would be a rare exception to do this. If it's really
expensive (time consuming) to recomputing a lost partition or if the use
case is extremely time sensitive, then maybe you could replicate it in
memory.
Are you running Spark in Local or Standalone mode? In either mode, you
should be able to hit port 4040 (to see the Spark
Jobs/Stages/Storage/Executors UI) on the machine where the driver is
running. However, in local mode, you won't have a Spark Master UI on 7080
or a Worker UI on 7081.
You can
Have you tried taking thread dumps via the UI? There is a link to do so on
the Executors' page (typically under http://driver IP:4040/exectuors.
By visualizing the thread call stack of the executors with slow running
tasks, you can see exactly what code is executing at an instant in time. If
you
Hi Sunita,
This gitbook may also be useful for you to get Spark running in local mode
on your Windows machine:
http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/
On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You could try following this
Hi Shahab,
Are you running Spark in Local, Standalone, YARN or Mesos mode?
If you're running in Standalone/YARN/Mesos, then the .count() action is
indeed automatically parallelized across multiple Executors.
When you run a .count() on an RDD, it is actually distributing tasks to
different
By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as that could make a big speed improvement.
On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com
wrote:
Hi Shahab,
Are you running Spark in Local, Standalone, YARN
Hey Stuart,
The RDD won't show up under the Storage tab in the UI until it's been
cached. Basically Spark doesn't know what the RDD will look like until it's
cached, b/c up until then the RDD is just on disk (external to Spark). If
you launch some transformations + an action on an RDD that is
the trigger on my original email. I should have
added that I'm tried using persist() and cache() but no joy.
I'm doing this:
data = sc.textFile(somedata)
data.cache
data.count()
but I still can't see anything in the storage?
On 31 October 2014 10:42, Sameer Farooqui same...@databricks.com
That does seem a bit odd. How many Executors are running under this Driver?
Does the spark-submit process start out using ~60GB of memory right away or
does it start out smaller and slowly build up to that high? If so, how long
does it take to get that high?
Also, which version of Spark are you
Hi Saiph,
Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC
Strata last week where they created a prototype Spark Streaming + Kafka
application for time series data.
You can see the code here:
https://github.com/killrweather/killrweather
On Tue, Oct 21, 2014 at 4:33 PM,
Hi Keith,
Would be helpful if you could post the error message.
Are you running Spark in Standalone mode or with YARN?
In general, the Spark Master is only used for scheduling and it should be
fine with the default setting of 512 MB RAM.
Is it actually the Spark Driver's memory that you
Hi Sadhan,
Which port are you specifically trying to redirect? The driver program has
a web UI, typically on port 4040... or the Spark Standalone Cluster Master
has a UI exposed on port 7077.
Which setting did you update in which file to make this change?
And finally, which version of Spark are
Hi Shailesh,
Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.
This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a
30 matches
Mail list logo