As I understand the matter:
Option 1) has benefits when you think that your network bandwidth may be a
bottle neck, because Spark opens several network connections on possibly
several different physical machines.
Option 2) - as you already pointed out - has the benefit that you occupy
less
Hi Sparkers,
I am trying to creating hive table in SparkSql.But couldn't able to create
it.Below are the following errors which are generating so far.
java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
To distill this a bit further, I don't think you actually want rdd2 to
wait on rdd1 in this case. What you want is for a request for
partition X to wait if partition X is already being calculated in a
persisted RDD. Otherwise the first partition of rdd2 waits on the
final partition of rdd1 even
In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish
probably due to writing to hdfs. a walk around for this particular case may be
as follows.
val rdd1 = ..cache()
val rdd2 = rdd1.map().()
rdd1.count
future { rdd1.saveAsHasoopFile(...) }
future {
It may take some work to do online updates with an
MatrixFactorizationModel because you need to update some rows of the
user/item factors. You may be interested in spark-indexedrdd
(http://spark-packages.org/package/amplab/spark-indexedrdd).
We support save/load in Scala/Java. We are going to add
SPARK_CLASSPATH is definitely deprecated, but my understanding is that
spark.executor.extraClassPath is not, so maybe the documentation needs
fixing.
I'll let someone who might know otherwise comment, though.
On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah kra...@maprtech.com wrote:
Hi Imran,
I can confirm this still happens when calling Kryo serialisation directly,
not I’m using Java. The output file is at about 440mb at the time of the
crash.
Kryo is version 2.21. When I get a chance I’ll see if I can make a
shareable test case and try on Kryo 3.0, I doubt they’d be
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been
fixed in 1.2.1 and 1.3.
On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Can someone confirm if they can run UDFs in group by in spark1.2?
I have two builds running -- one from a custom
I've been searching around and see others have asked similar questions.
Given a schemaRDD I extract a restless that contains numbers, both Int and
Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a
textile and then read them back in splitting them with some code I found in
Short story: I want to write some parquet files so they are pre-partitioned by
the same key. Then, when I read them
back in, joining the two tables on that key should be about as fast as things
can be done.
Can I do that, and if so, how? I don't see how to control the partitioning of a
SQL
Imran, thanks for your explaining about the parallelism. That is very helpful.
In my test case, I am only use one box cluster, with one executor. So if I put
10 cores, then 10 concurrent task will be run within this one executor, which
will handle more data than 4 core case, then leaded to OOM.
Zhan,
I think it might be helpful to point out that I'm trying to run the RDDs in
different threads to maximize the amount of work that can be done
concurrently. Unfortunately, right now if I had something like this:
val rdd1 = ..cache()
val rdd2 = rdd1.map().()
future {
Hi Kannan,
I believe you should be able to use the --jars for this when invoke the
spark-shell or perform a spark-submit. Per docs:
--jars JARSComma-separated list of local jars to include on the
driver
and executor classpaths.
HTH.
-Todd
On Thu, Feb
Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.
Is it possible that
Hey Mike,
I quickly looked through the example and I found major performance issue.
You are collecting the RDDs to the driver and then sending them to Mongo in
a foreach. Why not doing a distributed push to Mongo?
WHAT YOU HAVE
val mongoConnection = ...
WHAT YOU SHUOLD DO
rdd.foreachPartition
Here is my understanding.
When running on top of yarn, the cores means the number of tasks can run in one
executor. But all these cores are located in the same JVM.
Parallelism typically control the balance of tasks. For example, if you have
200 cores, but only 50 partitions. There will be 150
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet,
the optimizer will do column pruning, predictor pushdown, etc. Thus you can
the benefit of parquet column benefits. After that, you can operate the
SchemaRDD (DF) like regular RDD.
Thanks.
Zhan Zhang
On Feb 26,
What confused me is the statement of *The final result is that rdd1 is
calculated twice.” *Is it the expected behavior?
To be perfectly honest, performing an action on a cached RDD in two
different threads and having them (at the partition level) block until the
parent are cached would be the
Yeah, I believe Corey knows that much and is using foreachPartition(i
= None) to materialize. The question is, how would you do this with
an arbitrary DAG? in this simple example we know what the answer is
but he's trying to do it programmatically.
On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang
I think we already covered that in this thread. You get dependencies
from RDD.dependencies()
On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang zzh...@hortonworks.com wrote:
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we
should use spark.executor.extraClassPath instead. But the online
documentation states that spark.executor.extraClassPath is only meant for
backward compatibility.
Hi, All,
When I run a small program in spark-shell, I got the following error:
...
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in
java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at
I can able to run it without any issue from standalone as well as in cluster.
spark-submit --class org.graphx.test.GraphFromVerteXEdgeArray
--executor-memory 1g --driver-memory 6g --master spark://VM-Master:7077
spark-graphx.jar
code is exact same as above
--
View this message in context:
The issue is that both RDDs are being evaluated at once. rdd1 is
cached, which means that as its partitions are evaluated, they are
persisted. Later requests for the partition hit the cached partition.
But we have two threads causing two jobs to evaluate partitions of
rdd1 at the same time. If
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated
recently and has a good comparison.
- Although grid gain has been around since the spark days, Apache Ignite is
quite new and just getting started I think so
- you will probably want to reach out to the developers
bq. whether or not rdd1 is a cached rdd
RDD has getStorageLevel method which would return the RDD's current storage
level.
SparkContext has this method:
* Return information about what RDDs are cached, if they are in mem or
on disk, how much space
* they take, etc.
*/
@DeveloperApi
Hi Dan,
This is a CDH issue, so I'd recommend using cdh-u...@cloudera.org for
those questions.
This is an issue with fixed in recent CM 5.3 updates; if you're not
using CM, or want a workaround, you can manually configure
spark.driver.extraLibraryPath and spark.executor.extraLibraryPath
to
I should probably mention that my example case is much over simplified-
Let's say I've got a tree, a fairly complex one where I begin a series of
jobs at the root which calculates a bunch of really really complex joins
and as I move down the tree, I'm creating reports from the data that's
already
I'm considering whether or not it is worth introducing Spark at my new
company. The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).
I'm wondering at which point it is worth the overhead of adding the Spark
infrastructure (deployment scripts, monitoring,
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides
here:
https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x
Couple main pts around comments made during the preso.., although incubating
apache (first code drop was last week I believe).., tech is battle tested
with
Zhan,
This is exactly what I'm trying to do except, as I metnioned in my first
message, I am being given rdd1 and rdd2 only and I don't necessarily know
at that point whether or not rdd1 is a cached rdd. Further, I don't know at
that point whether or not rdd2 depends on rdd1.
On Thu, Feb 26,
Hi all,
I'm incrementing several accumulators inside a foreach. Most of the time,
the accumulators will return the same value for the same dataset. However,
they sometimes differ.
I'm not sure how accumulators are implemented. Could this behavior be caused
by data not arriving before I print
Ted. That one I know. It was the dependency part I was curious about
On Feb 26, 2015 7:12 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. whether or not rdd1 is a cached rdd
RDD has getStorageLevel method which would return the RDD's current
storage level.
SparkContext has this method:
*
Try the following:
df.map { case Row(id: Int, num: Int, value: Double, x: Float) = //
replace those with your types
(id, Vectors.dense(num, value, x))
}.toDF(id, features)
-Xiangrui
On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk mobsn...@gmail.com wrote:
I've been searching around and see others
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 4:20 PM, Corey Nolet
cjno...@gmail.commailto:cjno...@gmail.com wrote:
Ted. That one I know. It was the dependency part I was curious about
On Feb
Assign an alias to the count in the select clause and use that alias in the
order by clause.
On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com
wrote:
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
--
Date: Thu, 26 Feb 2015
The honest answer is that it is unclear to me at this point. I guess what
I am really wondering is if there are cases where one would find it
beneficial to use Spark against one or more RDBs?
On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Gary,
On Fri, Feb 27, 2015
It is not in MLlib. There is a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an
implementation for integer values. -Xiangrui
On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:
Has KNN classification algorithm been implemented on MLlib?
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.
Thanks,
Gary
On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi
On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf
Hi Kannan,
Issues with using --jars make sense. I believe you can set the classpath
via the use the --conf spark.executor.extraClassPath= or in your driver
with .set(spark.executor.extraClassPath, .)
I believe you are correct with the localize as well as long as your
guaranteed that all
Thanks you all!
On Thu, Feb 26, 2015 at 5:50 PM, n...@reactor8.com wrote:
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides
here:
https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x
Couple main pts around comments made during the preso.., although
Thanks Marcelo. Do you think it would be useful to make
spark.executor.extraClassPath be made to pick up some environment variable
that can be set from spark-env.sh? Here is a example.
spark-env.sh
--
executor_extra_cp = get_hbase_jars_for_cp
export executor_extra_cp
Hi
On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote:
The honest answer is that it is unclear to me at this point. I guess what
I am really wondering is if there are cases where one would find it
beneficial to use Spark against one or more RDBs?
Well, RDBs are all
If anyone is curious to try exporting Spark metrics to Graphite, I just
published a post about my experience doing that, building dashboards in
Grafana http://grafana.org/, and using them to monitor Spark jobs:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
Code
I'm trying to run a spark application using bin/spark-submit. When I
reference my application jar inside my local filesystem, it works. However,
when I copied my application jar to a directory in hdfs, i get the following
exception:
Warning: Skip remote jar
On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote:
Also, I would like to know if there is a localization overhead when we use
spark.executor.extraClassPath. Again, in the case of hbase, these jars would
be typically available on all nodes. So there is no need to localize
Cool, great job☺.
Thanks
Jerry
From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com]
Sent: Thursday, February 26, 2015 6:11 PM
To: user; d...@spark.apache.org
Subject: Monitoring Spark with Graphite and Grafana
If anyone is curious to try exporting Spark metrics to Graphite, I just
I think the setting you are missing is 'spark.worker.cleanup.appDataTtl'.
This setting controls how long the age of a file has to be before it is
deleted. More info here:
https://spark.apache.org/docs/1.0.1/spark-standalone.html.
Also, 'spark.worker.cleanup.interval' you have configured is pretty
Love to hear some input on this. I did get a standalone cluster up on my
local machine and the problem didn't present itself. I'm pretty confident
that means the problem is in the LocalBackend or something near it.
On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote:
Okay
Of course, breakpointing on every status update and revive offers
invocation kept the problem from happening. Where could the race be?
On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote:
Love to hear some input on this. I did get a standalone cluster up on my
local
Gary,
On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote:
I'm considering whether or not it is worth introducing Spark at my new
company. The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).
Will it ever become Hadoop size? Looking
There is a usability concern I have with the current way of specifying
--jars. Imagine a use case like hbase where a lot of jobs need it in its
classpath. This needs to be set every time. If we use
spark.executor.extraClassPath,
then we just need to set it once But there is no programmatic way to
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote:
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.
You got me there ;-)
Tobias
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the
shuffle setting: spark.sql.shuffle.partitions
On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote:
Imran, thanks for your explaining about the parallelism. That is very
helpful.
In my test case, I am
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import org.apache.spark.scheduler._
import org.roaringbitmap._
...
kryo.register(classOf[org.roaringbitmap.RoaringBitmap])
kryo.register(classOf[org.roaringbitmap.RoaringArray])
Has KNN classification algorithm been implemented on MLlib?
Thank You
Regards,
Deep
http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg
I get the above error when I try to run pyspark with the ipython option. I
do not get this error when I run it without the ipython option.
I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Win
101 - 157 of 157 matches
Mail list logo