Sean,
Thanks. That worked.
Kevin
On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:
This is more of a Java / Maven issue than Spark per se. I would use
the shade plugin to remove signature files in your final META-INF/
dir. As Spark does, in its configuration:
filters
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one
Seems like the thriftServer cannot connect to Zookeeper, so it cannot get
lock.
This is how it the log looks when I run SparkSQL:
load data inpath kv1.txt into table src;
log:
14/09/16 14:40:47 INFO Driver: PERFLOG method=acquireReadWriteLocks
14/09/16 14:40:47 INFO ClientCnxn: Opening socket
hi,
I am trying to write some unit test, following spark programming guide
http://spark.apache.org/docs/latest/programming-guide.html#unit-testing.
but I observed unit test runs very slow(the code is just a SparkPi), so I
turn log level to trace and look through the log output. and found
Hi, Hao Cheng.
I have done other tests. And the result shows the thriftServer can connect
to Zookeeper.
However, I found some more interesting things. And I think I have found a
bug!
Test procedure:
Test1:
(0) Use beeline to connect to thriftServer.
(1) Switch database use dw_op1; (OK)
The logs
Thanks Christian! I tried compiling from source but am still getting the
same hadoop client version error when reading from HDFS. Will have to poke
deeper... perhaps I've got some classpath issues. FWIW I compiled using:
$ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
Hi Michael,
Please correct me if I am wrong. The error seems to originate from spark
only. Please have a look at the stack trace of the error which is as
follows:
[error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any
suitable constructor for class
I connect my sample project to a hosted CI service, it only takes 3 seconds
to run there...while the same tests takes 2minutes on my macbook pro. so
maybe this is a mac os specific problem?
On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote:
hi,
I am trying to write some unit test,
Cool.. While let me try that.. any other suggestion(s) on things I can try?
On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu dav...@databricks.com wrote:
I think the 1.1 will be really helpful for you, it's all compatitble
with 1.0, so it's
not hard to upgrade to 1.1.
On Mon, Sep 15, 2014 at
Is 1.0.8 working for you ?
You indicated your last known good version is 1.0.0
Maybe we can track down where it broke.
On Sep 16, 2014, at 12:25 AM, Paul Wais pw...@yelp.com wrote:
Thanks Christian! I tried compiling from source but am still getting the
same hadoop client version
it works, thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
sorry for disturb, please ignore this mail
in the end, I find it slow because lack of memory in my machine..
sorry again.
On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 noty...@gmail.com wrote:
I connect my sample project to a hosted CI service, it only takes 3
seconds to run there...while the same
Writing to Parquet and querying the result via SparkSQL works great (except for
some strange SQL parser errors). However the problem remains, how do I get that
data back to a dashboard. So I guess I’ll have to use a database after all.
You can batch up data store into parquet partitions as
Thank you for pasting the steps, I will look at this, hopefully come out with a
solution soon.
-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP
Dear Ankur,
Thanks! :)
- from [1], and my understanding, the existing inactive feature in graphx
pregel api is “if there is no in-edges, from active vertex, to this vertex,
then we will say this one is inactive”, right?
For instance, there is a graph in which every vertex has at least one
At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote:
- from [1], and my understanding, the existing inactive feature in graphx
pregel api is “if there is no in-edges, from active vertex, to this vertex,
then we will say this one is inactive”, right?
Well, that's true when
At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote:
but I am wondering if there is a message(none?) sent to the target vertex(the
rank change is less than tolerance) in below dynamic page rank implementation,
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
From the caller / application perspective, you don't care what version
of Hadoop Spark is running on on the cluster. The Spark API you
compile against is the same. When you spark-submit the app, at
runtime, Spark is using the Hadoop libraries from the cluster, which
are the right version.
So when
How many cores do your machines have? --executor-cores should be the
number of cores each executor uses. Fewer cores means more executors
in general. From your data, it sounds like, for example, there are 7
nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores
available. Hence
I have a standalone spark cluster and from within the same scala
application I'm creating 2 different spark context to run two different
spark streaming jobs as SparkConf is different for each of them.
I'm getting this error that... I don't really understand:
14/09/16 11:51:35 ERROR
It seems that, as I have a single scala application, the scheduler is the
same and there is a collision between executors of both spark context. Is
there a way to change how the executor ID is generated (maybe an uuid
instead of a sequential number..?)
2014-09-16 13:07 GMT+01:00 Luis Ángel
I am expand my data set and executing pyspark on yarn:
I payed attention that only 2 processes processed the data:
14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7
32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7
*Question:*
*how to configure
When I said scheduler I meant executor backend.
2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez
langel.gro...@gmail.com:
It seems that, as I have a single scala application, the scheduler is the
same and there is a collision between executors of both spark context. Is
there a way to
I dug a bit more and the executor ID is a number so it's seems there is not
possible workaround.
Looking at the code of the CoarseGrainedSchedulerBackend.scala:
From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get something like Tuple2Integer, Listlt;Integer.
The only way I found to do this was by treating all variables as String, and
in the reduceByKey do
/return a._2 + , + b._2/ //in which both are numeric
If you mean you have (key,value) pairs, and want pairs with key, and
all values for that key, then you're looking for groupByKey
On Tue, Sep 16, 2014 at 2:42 PM, Tom thubregt...@gmail.com wrote:
From my map function I create Tuple2Integer, Integer pairs. Now I want to
reduce them, and get
If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
Hi All,
I suspect I am experiencing a bug. I've noticed that while running
larger jobs, they occasionally die with the exception
java.util.NoSuchElementException: key not found xyz, where xyz
denotes the ID of some particular task. I've excerpted the log from
one job that died in this way below
Hi,
I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By following the instructions of
https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
I could successfully run everything until the step of SVMWithSGD, I got
error the
Hello,
Suppose I want to use Spark from an application that I already submit to run
in another container (e.g. Tomcat). Is this at all possible? Or do I have to
split the app into two components, and submit one to Spark and one to the other
container? In that case, what is the
Hello. I have a hadoopFile RDD and i tried to collect items to driver
program, but it returns me an array of identical records (equals to last
record of my file). My code is like this:
val rdd = sc.hadoopFile(
hdfs:///data.avro,
Hello,
I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read
from hbase. The Java code is attached. However the problem is TableInputFormat
does not even exist in hbase-client API, is there any other way I can read from
hbase? Thanks
SparkConf sconf = new
This problem was caused by the fact that I used a package jar with a Spark
version (0.9.1) different from that of the cluster (0.9.0). When I used the
correct package jar
(spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the
application can run as expected.
2014-09-15 14:57
bq. TableInputFormat does not even exist in hbase-client API
It is in hbase-server module.
Take a look at http://hbase.apache.org/book.html#mapreduce.example.read
On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote:
Hello,
I’m currently using spark-core 1.1 and hbase 0.98.5 and
hbase-client module serves client facing APIs.
hbase-server module is supposed to host classes used on server side.
There is still some work to be done so that the above goal is achieved.
On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong tq00...@gmail.com wrote:
Thanks Ted. It is indeed in
Can I load that plugin in spark-shell? Or perhaps due the 2-phase
compilation quasiquotes won't work in shell?
On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra m...@clearstorydata.com
wrote:
Okay, that's consistent with what I was expecting. Thanks, Matei.
On Mon, Sep 15, 2014 at 5:20 PM, Matei
Hi,
I had a similar situation in which I needed to read data from HBase and work
with the data inside of a spark context. After much ggling, I finally got
mine to work. There are a bunch of steps that you need to do get this working -
The problem is that the spark context does not know
Btw, there are some examples in the Spark GitHub repo that you may find
helpful. Here's one
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
related to HBase.
On Tue, Sep 16, 2014 at 1:22 PM, abraham.ja...@thomsonreuters.com wrote:
If you want to run the computation on just one machine (using Spark's local
mode), it can probably run in a container. Otherwise you can create a
SparkContext there and connect it to a cluster outside. Note that I haven't
tried this though, so the security policies of the container might be too
It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.
Embedding Spark inside the container is not a great long term solution IMO
Yes that was very helpful… ☺
Here are a few more I found on my quest to get HBase working with Spark –
This one details about Hbase dependencies and spark classpaths
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
This one has a code overview –
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?
On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:
Thank you for pasting the steps, I will look at this, hopefully come out
with a solution soon.
-Original Message-
From: linkpatrickliu
I meant it may be a Hive bug since we also call Hive's drop table
internally.
On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai huaiyin@gmail.com wrote:
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related?
On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote:
Thank
Hi All,
I have data in for following format:L
1st column is userid and the second column onward are class ids for various
products. I want to save this in Libsvm format and an intermediate step is to
sort (in ascending manner) the class ids. For example: I/Puid1 12433580
2670122
Hello,
Thanks for the response and great to hear it is possible. But how do I
connect to Spark without using the submit script?
I know how to start up a master and some workers and then connect to the
master by packaging the app that contains the SparkContext and then submitting
the
Hi,
The test case is separated out as follows. The call to rdd2.first() breaks when
spark version is changed to 1.1.0, reporting exception NullWritable not
serializable. However, the same test passed with spark 1.0.2. The pom.xml file
is attached. The test data README.md was copied from spark.
You can create a new SparkContext inside your container pointed to your
master. However, for your script to run you must call addJars to put the
code on your workers' classpaths (except when running locally).
Hopefully your webapp has some lib folder which you can point to as a
source for the
Hi,
Spark job server by ooyala is the right tool for the job. It exposes rest api
so calling it from a web app is suitable.
Is open source, you can find it on github
Best
Paolo Platter
Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com
Inviato:
Hi Sean,
Great catch! Yes I was including Spark as a dependency and it was
making its way into my uber jar. Following the advice I just found at
Stackoverflow[1], I marked Spark as a provided dependency and that
appeared to fix my Hadoop client issue. Thanks for your help!!!
Perhaps they
Hi all,
Spark is taking too much time to start the first stage with many small
files in HDFS.
I am reading a folder that contains RC files:
sc.hadoopFile(hdfs://hostname :8020/test_data2gb/,
classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
classOf[LongWritable],
Hi,
I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way
Does MLlib provide utility functions to do this kind of encoding?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216
On Tue, Sep 16, 2014 at 10:04 PM, st553 sthompson...@gmail.com wrote:
Does MLlib provide utility functions to do this kind of encoding?
--
View this message in context:
Hi, I'm a Spark newbie.
We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb memory
and 48 cores.
Tried to allocate a task with 64gb memory but for whatever reason Spark is
only using around 9gb max.
Submitted spark job with the following command:
/bin/spark-submit -class
Perhaps your job does not use more than 9g. Even though the dashboard shows
64g the process only uses whats needed and grows to 64g max.
On Tue, Sep 16, 2014 at 5:40 PM, francisco ftanudj...@nextag.com wrote:
Hi, I'm a Spark newbie.
We had installed spark-1.0.2-bin-cdh4 on a 'super machine'
Hi, guys
My current project is using Spark 0.9.1, and after increasing the level of
parallelism and partitions in our RDDs, stages and tasks seem to complete
much faster. However it also seems that our cluster becomes more unstable
after some time:
- stalled stages still showing under active
Thanks for the reply.
I doubt that's the case though ... the executor kept having to do a file
dump because memory is full.
...
14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67
MB to disk (668 times so far)
14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling
Brand new to Apache Spark and I'm a little confused how to make updates to a
value that sits outside of a .mapTriplets iteration in GraphX. I'm aware
mapTriplets is really only for modifying values inside the graph. What about
using it in conjunction with other computations? See below:
def
I see, what does http://localhost:4040/executors/ show for memory usage?
I personally find it easier to work with a standalone cluster with a single
worker by using the sbin/start-master.sh and then connecting to the master.
On Tue, Sep 16, 2014 at 6:04 PM, francisco ftanudj...@nextag.com wrote:
Hello Spark Community -
I am using the support vector machine / SVM implementation in MLlib with
the standard linear kernel; however, I noticed in the Spark documentation
for StandardScaler is *specifically* mentions that SVMs which use the RBF
kernel work really well when you have standardized
This should be a really simple problem, but you haven't shared enough code
to determine what's going on here.
On Tue, Sep 16, 2014 at 8:08 AM, Hui Li littleleave...@gmail.com wrote:
Hi,
I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By
Hello friends:
Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.
Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
I have a use case where my RDD is set up such:
Partition 0:
K1 - [V1, V2]
K2 - [V2]
Partition 1:
K3 - [V1]
K4 - [V3]
I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle. It doesn't matter if the partitions
of the inverted RDD have non unique
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.
On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote:
I have a use case where
Thanks for the tip.
http://localhost:4040/executors/ is showing
Executors(1)
Memory: 0.0 B used (294.9 MB Total)
Disk: 0.0 B Used
However, running as standalone cluster does resolve the problem.
I can see a worker process running w/ the allocated memory.
My conclusion (I may be wrong) is for
http://spark.apache.org/docs/latest/quick-start.html#standalone-applications
Click on java tab There is a bug in the maven section
version1.1.0-SNAPSHOT/version
Should be
version1.1.0/version
Hope this helps
Andy
Thank you Yin Huai. This is probably true.
I saw in the hive-site.xml, Liu has changed the entry, which is default should
be false.
property
namehive.support.concurrency/name
descriptionEnable Hive's Table Lock Manager Service/description
valuetrue/value
/property
Someone is
Is there some way to ship textfile just like ship python libraries?
Thanks in advance
Daijia
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html
Sent from the Apache Spark User List mailing
You can send an email like you just did or open an issue in the Spark issue
tracker http://issues.apache.org/jira/. This looks like a problem with
how the version is generated in this file
https://github.com/apache/spark/blob/branch-1.1/docs/quick-start.md.
On Tue, Sep 16, 2014 at 8:55 PM, Andy
Hi Sandy:
Thank you. I have not tried that mechanism (I wasn't are of it). I will
try that instead.
Is it possible to also represent '--driver-memory' and
'--executor-memory' (and basically all properties)
using the '--conf' directive?
The Reason: I actually discovered the below issue
Hi
I need to get the CPU utilisation, RAM usage, Network IO and other metrics
using Java program. Can anyone help me on this?
Thanks
Shalish.
Yes, sc.addFile() is what you want:
| addFile(self, path)
| Add a file to be downloaded with this Spark job on every node.
| The C{path} passed can be either a local file, a file in HDFS
| (or other Hadoop-supported filesystems), or an HTTP, HTTPS or
| FTP URI.
|
|
Not particularly related to Spark, but you can check out SIGAR API. It let's
you get CPU, Memory, Network, Filesystem and process based metrics.
Amit
On Sep 16, 2014, at 20:14, VJ Shalish vjshal...@gmail.com wrote:
Hi
I need to get the CPU utilisation, RAM usage, Network IO and other
Thank u for the response Amit.
So is it that, we cannot measure the CPU consumption, RAM usage of a spark
job through a Java program?
On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote:
Not particularly related to Spark, but you can check out SIGAR API. It
let's you get CPU,
Sorry for the confusion Team.
My requirement is to measure the CPU utilisation, RAM usage, Network IO and
other metrics of a SPARK JOB using Java program.
Please help on the same.
On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote:
Not particularly related to Spark, but you can
Hi,
I am running Spark in cluster mode with Hadoop YARN as the underlying
cluster manager. I get this error when trying to initialize the
SparkContext.
Exception in thread main org.apache.spark.SparkException: YARN mode not
available ?
at
Hi,
I am a freshman about spark. I tried to run a job like wordcount example in
python. But when I tried to get the top 10 popular words in the file, I got
the message:AttributeError: 'PipelinedRDD' object has no attribute
'sortByKey'.
So my question is what is the difference between
I am running spark on shared yarn cluster.
My user ID is online, but I found that when I run my spark application,
local directories are created by yarn user ID.
So I am unable to delete local directories and finally application failed.
Please refer to my log below:
14/09/16 21:59:02 ERROR
78 matches
Mail list logo