Hi,
Organization name: Amrita Center for Cyber Security Systems and Networks
URL: https://www.amrita.edu/center/cyber-security
We use Spark for BigData analytics and ML/Data Mining.
Spark Streaming in IoT Platform
--
Regards,
Bilna P
-dev
Guava was not downgraded to 11. That PR was not merged. It was part of a
discussion about, indeed, what to do about potential Guava version
conflicts. Spark uses Guava, but so does Hadoop, and so do user programs.
Spark uses 14.0.1 in fact:
Hi Sean,
My mistake, Guava 11 dependency came from the hadoop-commons indeed.
I'm running the following simple app in spark 1.2.0 standalone local
cluster (2 workers) with Hadoop 1.2.1
public class AvroSparkTest {
public static void main(String[] args) throws Exception {
SparkConf
Oh, are you actually bundling Hadoop in your app? that may be the problem.
If you're using stand-alone mode, why include Hadoop? In any event, Spark
and Hadoop are intended to be 'provided' dependencies in the app you send
to spark-submit.
On Tue, Jan 6, 2015 at 10:15 AM, Niranda Perera
Hi,
I have been running a simple Spark app on a local spark cluster and I came
across this error.
Exception in thread main java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
I had ran your scripts in 5 nodes ( 2 CPUs, 8G mem) cluster, can not
reproduce your failure. Should I test it with big memory node?
On Mon, Jan 5, 2015 at 4:00 PM, Sven Krasser kras...@gmail.com wrote:
Thanks for the input! I've managed to come up with a repro of the error with
test data only
That's an issue with your firewall (more likely hostnames issue in
/etc/hosts), You may find the following posts helpful
-
http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark
-
http://koobehub.wordpress.com/2014/09/29/spark-the-standalone-cluster-deployment/
You can try something like:
*val top10 = your_stream.mapPartitions(rdd = rdd.take(10))*
Thanks
Best Regards
On Mon, Jan 5, 2015 at 11:08 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:
Hi,
I am counting values in each window and find the top values and want to
save only the top 10
Hello! I just had a very similar stack trace. It was caused by an Akka
version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident
instead of 1.2.)
On Mon, Nov 24, 2014 at 7:15 PM, Blackeye black...@iit.demokritos.gr
wrote:
I created an application in spark. When I run it with
Good luck. Let me know If I can assist you further
Regards
-Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Skype
pankaj.narang
--
View this message in context:
I suggest to create uber jar instead.
check my thread for the same
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html
Regards
-Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Or you can use:
sc.addJar(/path/to/your/datastax.jar)
Thanks
Best Regards
On Tue, Jan 6, 2015 at 5:53 PM, bchazalet bchaza...@companywatch.net
wrote:
I don't know much about spark-jobserver, but you can set jars
programatically
using the method setJars on SparkConf. Looking at your code it
I don't know much about spark-jobserver, but you can set jars programatically
using the method setJars on SparkConf. Looking at your code it seems that
you're importing classes from com.datastax.spark.connector._ to load data
from cassandra, so you may need to add that datastax jar to your
Hi,
Just wondering whether this is released yet and if so on which version of
Spark ?
Many Thanks,
Thomas
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streamSQL-is-it-available-or-is-it-in-POC-tp20993.html
Sent from the Apache Spark User List
Thanks Pankaj for the assembly plugin tip.
Yes there is a version mismatch of akka actor between Spark 1.1.1 and
akka-http/akka-stream (2.2.3 versus 2.3.x).
After some digging, I see 4 options for this problem (in case others
encounter it):
1) Upgrade to Spark 1.2.0, the same code will work (not
Boris,
Thank you for your suggestion. I used following code and still facing the
same issue -
val conf = new SparkConf(true).set(spark.cassandra.connection.host,
127.0.0.1)
.setAppName(jobserver test demo)
Hey,
I have a job that keeps failing if too much data is processed, and I can't
see how to get it working. I've tried repartitioning with more partitions
and increasing amount of memory for the executors (now about 12G and 400
executors. Here is a snippets of the first part of the code, which
Many thanks Pankaj, I've got it working.
For completeness, here's the whole segment (including the printout at diff
stages):
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20996.html
Sent from the
I am a bit new to Spark, except that I tried simple things like word count, and
the examples given in the spark sql programming guide.
Now, I am investigating the internals of Spark, but I think I am almost lost,
because I could not grasp a whole picture what spark does when it executes the
Hi Pro,
One map() operation in my Spark APP takes an RDD[A] as input and map each
element in RDD[A] using a custom mapping function func(x:A):B to another object
of type B.
I received lots of OutOfMemory error, and after some debugging I find this is
because func() requires significant
Hi all,
I have a Spark streaming application that ingests data from a Kafka topic and
persists received data to Hbase. It works fine with Spark 1.1.1 in YARN cluster
mode. Basically, I use the following code to persist each partition of each RDD
to Hbase:
@Override
void
Hello,
We have hadoop 2.6.0 and Yarn set up on ec2. Trying to get spark 1.1.1 running
on the Yarn cluster.
I have of course googled around and found that this problem is solved for most
after removing the line including 127.0.1.1 from /etc/hosts. This hasn’t seemed
to solve this for me.
Hi,
Anybody tried to connect to spark cluster( on UNIX machines) from windows
interactive shell ?
-Naveen.
Thats great. I was not having access on the developer machine so sent you the
psuedo code only.
Happy to see its working. If you need any more help related to spark let me
know anytime.
--
View this message in context:
It does not look like you're supposed to fiddle with the SparkConf and even
SparkContext in a 'job' (again, I don't know much about jobserver), as
you're given a SparkContext as parameter in the build method.
I guess jobserver initialises the SparkConf and SparkContext itself when it
first
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored. The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the
*@Sasi*
You should be able to create a job something like this:
package io.radtech.spark.jobserver
import java.util.UUID
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.rdd.RDD
import org.joda.time.DateTime
import com.datastax.spark.connector.types.TypeConverter
I have an application where a function needs access to the results of a
select from a parquet database. Creating a JavaSQLContext and from it
a JavaSchemaRDD
as shown below works but the parallelism is not needed - a simple JDBC call
would work -
Are there alternative non-parallel ways to
So you want windows covering the same length of time, some of which will be
fuller than others? You could, for example, simply bucket the data by
minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
a timestamp in ms, you could:
tickerRDD.groupBy(ticker = (ticker.timestamp /
Hi Manjul,
Each StreamingContext will have its own batch size. If that doesn’t work
for the different sources you have then you would have to create different
streaming apps. You can only create a new StreamingContext in the same
Spark app, once you’ve stopped the previous one.
Spark
I still can not reproduce it with 2 nodes (4 CPUs).
Your repro.py could be faster (10 min) than before (22 min):
inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or
1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc):
pc==3).collect()
(also, no cache needed anymore)
Davies
On Tue, Jan 6,
Hi,
We have a requirement of receiving live input messages from RabbitMQ and
process them into micro batches. For this we have selected SparkStreaming
and we have written a connector for RabbitMQ receiver and Spark streaming,
it is working fine.
Now the main requirement is to receive different
Hi all.
In order to get Spark to properly release memory during batch processing as a
workaround to issue https://issues.apache.org/jira/browse/SPARK-4927 I tear
down and re-initialize the spark context with :
context.stop() and
context = new SparkContext()
The problem I run into is that
thanks Xiangrui
I'll try it.
BTW: spark-submit is a standalone program (bin/spark-submit). Therefore,
JVM has to be executed after spark-submit script
Am I correct?
On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng men...@gmail.com wrote:
It might be hard to do that with spark-submit, because
No, most rdds partition input data appropriately.
On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter franc.car...@rozettatech.com
wrote:
One more question, to be clarify. Will every node pull in all the data ?
thanks
On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger c...@koeninger.org
wrote:
If
Hi,
A while ago, somebody asked about getting a confidence value of a
prediction with MLlib's implementation of Naive Bayes's classification.
I was wondering if there is any plan in the near future for the predict
function to return both a label and a confidence/probability? Or could the
private
Ah, so it's rdd specific - that would make sense. For those systems where
it is possible to extract sensible susbets the rdds do so. My use case,
which is probably biasing my thinking is DynamoDb which I don't think can
efficiently extract records from M-to-N
cheers
On Wed, Jan 7, 2015 at 6:59
Thanks. Another question. I have event data with timestamps. I want to
create a sliding window using timestamps. Some windows will have a lot of
events in them others won’t. Is there a way to get an RDD made of this kind
of a variable length window?
On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen
Except I want it to be a sliding window. So the same record could be in
multiple buckets.
On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen so...@cloudera.com wrote:
So you want windows covering the same length of time, some of which will
be fuller than others? You could, for example, simply bucket
One problem with this is that we are creating a lot of iterables containing
a lot of repeated data. Is there a way to do this so that we can calculate
a moving average incrementally?
On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen so...@cloudera.com wrote:
Yes, if you break it down to...
Hi,
As the ec2 launch script provided by spark uses
https://github.com/mesos/spark-ec2 to download and configure all the tools
in the cluster (spark, hadoop etc). You can create your own git repository
to achieve your goal. More precisely:
1. Upload your own version of spark in s3 at address
Interesting, I am not sure the order in which fold() encounters elements is
guaranteed, although from reading the code, I imagine in practice it is
first-to-last by partition and then folded first-to-last from those results
on the driver. I don't know this would lead to a solution though as the
One approach I was considering was to use mapPartitions. It is
straightforward to compute the moving average over a partition, except for
near the end point. Does anyone see how to fix that?
On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen so...@cloudera.com wrote:
Interesting, I am not sure the order
In my opinion you should use fold pattern. Obviously after an sort by
trasformation.
Paolo
Inviata dal mio Windows Phone
Da: Asim Jalismailto:asimja...@gmail.com
Inviato: 06/01/2015 23:11
A: Sean Owenmailto:so...@cloudera.com
Cc:
I work on a user to user recommender for a website using
mllib.recommendation.
I have created a file (recommends.txt) which contains the top 5
recommendations for each user id.
The file's form(recommends.txt) is something like this
(user::rec1:rec2:rec3:rec4:rec5):
/**file's snapshot**/
I guess I can use a similar groupBy approach. Map each event to all the
windows that it can belong to. Then do a groupBy, etc. I was wondering if
there was a more elegant approach.
On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:
Except I want it to be a sliding window. So
I get this exception(java.lang.UnsatisfiedLinkError) when the driver is
running inside JBoss.
We are running with DataStax 4.6 version, which is using spark 1.1.0. The
driver runs inside a wildfly container. The snappy-java version is 1.0.5.
2015-01-06 20:25:03,771 ERROR
Hi, I am getting this same error. Did you figure out how to solve the
problem? Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-and-HBase-Snappy-UnsatisfiedLinkError-tp19827p21005.html
Sent from the Apache Spark User List mailing list
Is there a way to use the ec2 launch script with a locally built version of
spark? I launch and destroy clusters pretty frequently and would like to not
have to wait each time for the master instance to compile the source as happens
when I set the -v tag with the latest git commit. To be clear,
Two billion words is a very large vocabulary… You can try solving this issue by
by setting the number of times words must occur in order to be included in the
vocabulary using setMinCount, this will be prevent common misspellings,
websites, and other things from being included and may improve
Oops, just kidding, this method is not in the current release. However, it is
included in the latest commit on git if you want to do a build.
On Jan 6, 2015, at 2:56 PM, Ganon Pierce ganon.pie...@me.com wrote:
Two billion words is a very large vocabulary… You can try solving this issue
by
You can also read about locality here in the docs:
http://spark.apache.org/docs/latest/tuning.html#data-locality
On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote:
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes if
I’m attempting to build from the latest commit on git and receive the following
error upon attempting to access the application web ui:
HTTP ERROR: 500
Problem accessing /jobs/. Reason:
Server Error
Powered by Jetty://
My driver also prints this error:
Thanks for the pointers. The issue was due to route caching by Spray, which
would always return the same value. Other than that the program is working
fine.
On Mon, Jan 5, 2015 at 12:44 AM, Simon Chan simonc...@gmail.com wrote:
Boromir,
You may like to take a look at how we make Spray and
FWIW I do not see any such error, after a mvn -DskipTests clean package
and ./bin/spark-shell from master. Maybe double-check you have done a
full clean build.
On Tue, Jan 6, 2015 at 9:09 PM, Ganon Pierce ganon.pie...@me.com wrote:
I’m attempting to build from the latest commit on git and
Hi,
I was doing a tests with ALS and I noticed that if I persist the inner
RDDs from a MatrixFactorizationModel the RDD is not replicated, it seems
like the storagelevel is hardcoded to MEMORY_AND_DISK, do you think it
makes sense to make that configurable?
[image: Inline image 1]
Might be due to conflict between multiple snappy jars.
Can you check the classpath to see if there are more than one snappy jar ?
Cheers
On Tue, Jan 6, 2015 at 2:26 PM, Charles charles...@cenx.com wrote:
I get this exception(java.lang.UnsatisfiedLinkError) when the driver is
running inside
Anyone got any further thoughts on this? I saw the _metadata file seems to
store the schema of every single part (i.e. file) in the parquet directory,
so in theory it should be possible.
Effectively, our use case is that we have a stack of JSON that we receive
and we want to encode to Parquet
I do not understand Chinese but the diagrams on that page are very helpful.
On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote:
A good beginning if you are chinese.
https://github.com/JerryLead/SparkInternals/tree/master/markdown
2015-01-07 10:13 GMT+08:00 bit1...@163.com
Hi Manoj,
I've noticed that the storage tab only shows RDDs that have been cached.
Did you call .cache() or .persist() on any of the RDDs?
Andrew
On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
I create a bunch of RDDs, including schema RDDs. When I run the
Hi,
On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote:
I am a bit new to Spark, except that I tried simple things like word
count, and the examples given in the spark sql programming guide.
Now, I am investigating the internals of Spark, but I think I am almost
lost, because I
A good beginning if you are chinese.
https://github.com/JerryLead/SparkInternals/tree/master/markdown
2015-01-07 10:13 GMT+08:00 bit1...@163.com bit1...@163.com:
Thank you, Tobias. I will look into the Spark paper. But it looks that
the paper has been moved,
Hey Davies,
Here are some more details on a configuration that causes this error for
me. Launch an AWS Spark EMR cluster as follows:
*aws emr create-cluster --region us-west-1 --no-auto-terminate \
--ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \
--bootstrap-actions
I want to support this but we don't yet. Here is the JIRA:
https://issues.apache.org/jira/browse/SPARK-3851
On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore dragoncu...@gmail.com wrote:
Anyone got any further thoughts on this? I saw the _metadata file seems
to store the schema of every single
Thanks Eric. Yes..I am Chinese, :-). I will read through the articles, thank
you!
bit1...@163.com
From: eric wong
Date: 2015-01-07 10:46
To: bit1...@163.com
CC: user
Subject: Re: Re: I think I am almost lost in the internals of Spark
A good beginning if you are chinese.
Hi,
I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via
rdd.mapPartitions(…). Using the latest release 1.2.0.
Simple example; load up some sample data from parquet on HDFS (about 380m rows,
10 columns) on a 7 node cluster.
val t =
Interestingly Google Chrome translates the materials.
Cheers
k/
On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote:
I do not understand Chinese but the diagrams on that page are very helpful.
On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote:
A good
Hi,
it looks to me as if you need the whole user database on every node, so
maybe put the id-name information as a Map[Id, String] in a broadcast
variable and then do something like
recommendations.map(line = {
line.map(uid = usernames(uid))
})
or so?
Tobias
Hi,
On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras samarasrigi...@gmail.com
wrote:
Yes something like this. Can you please give me an example to create a Map?
That depends heavily on the shape of your input file. What about something
like:
(for (line - Source.fromFile(filename).getLines())
Hi,
On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras samarasrigi...@gmail.com
wrote:
exactly thats what I'm looking for, my code is like this:
//code
val users_map = users_file.map{ s =
val parts = s.split(,)
(parts(0).toInt, parts(1))
}.distinct
//code
but i get the error:
if the classes are in the original location than i think its safe to say
that this makes it impossible for us to build one app that can run against
spark 1.0.x, 1.1.x and spark 1.2.x.
thats no big deal, but it does beg the question of what compatibility can
reasonably be expected for spark 1.x
From what I can tell, this isn't a firewall issue per se..it's how the
Remoting Service binds to an IP given cmd line parameters. So, if I have
a VM (or OpenStack or EC2 instance) running on a private network let's say,
where the IP address is 192.168.X.Y...I can't tell the Workers to reach me
on
Found the issue in JIRA:
https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT
On Tue, Jan 6, 2015 at 10:45 AM, Aaron aarongm...@gmail.com wrote:
From what I can tell, this isn't a firewall issue per se..it's how the
Remoting Service binds to an IP
spark-submit may not share the same JVM with Spark master and executors.
On Tue, Jan 6, 2015 at 11:40 AM, Tomas Hudik xhu...@gmail.com wrote:
thanks Xiangrui
I'll try it.
BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM
has to be executed after spark-submit
Which Spark version are you using? We made this configurable in 1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202
-Xiangrui
On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. fot...@gmail.com wrote:
Hi,
I was doing a tests
We are going to estimate the average distance using [HyperAnf](
http://arxiv.org/abs/1011.5599) on a 100 billion edge graph.
2015-01-07 2:18 GMT+08:00 Ankur Dave ankurd...@gmail.com:
[-dev]
What size of graph are you hoping to run this on? For small graphs where
materializing the all-pairs
As per telephonic call see how we can fetch the count
val tweetsCount = sql(SELECT COUNT(*) FROM tweets)
println(f\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on
this Dataset\n\n)
--
View this message in context:
Quoting Michael:
Predicate push down into the input format is turned off by default because
there is a bug in the current parquet library that null pointers when there are
full row groups that are null.
https://issues.apache.org/jira/browse/SPARK-4258
You can turn it on if you want:
Just follow this documentation
http://spark.apache.org/docs/1.1.1/running-on-yarn.html
Ensure that *HADOOP_CONF_DIR* or *YARN_CONF_DIR* points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to the dfs and connect to
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789.
In the new pipeline API, we can simply output two columns, one for the
best predicted class, and the other for probabilities or confidence
scores for each class. -Xiangrui
On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li
Hi,
I'm testing parquet file format, and the predicate pushdown is a very
useful feature for us.
However, it looks like the predicate push down doesn't work after I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true) Here
is my sql:
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within
RDDs, in fact, I don't think it makes any sense)
I suggest rather than having an RDD of file names, collect those file name
strings back on to the driver as a Scala array of file names, and then from
there, make an array
Could you attach the executor log? That may help identify the root
cause. -Xiangrui
On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote:
Hi All,
Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
local mode and not on distributed mode. Null
Thanks for that. Strangely enough I was actually using 1.1.1 where it did
seem to be enabled by default. Since upgrading to 1.2.0 and setting that
flag, I do get the expected result! Looks good!
On Tue, Jan 6, 2015 at 12:17 PM, Michael Armbrust mich...@databricks.com
wrote:
Predicate push
Hi,
I create a bunch of RDDs, including schema RDDs. When I run the program and
go to UI on xxx:4040, the storage tab does not shows any RDDs.
Spark version is 1.1.1 (Hadoop 2.3)
Any thoughts?
Thanks,
Thanks Ted. You are right, hbase-site.xml is in the classpath. But previously I
have it in the classpath too and the app works fine. I believe I found the
problem. I built Spark 1.2.0 myself and forgot to change the dependency hbase
version to 0.98.8-hadoop2, which is the version I use. When I
Awesome. Thanks again Ted. I remember there is a block in the pom.xml under the
example folder that default hbase version to hadoop1. I figured out this last
time when I built Spark 1.1.1 but forgot this time.
profile
idhbase-hadoop1/id
activation
property
Is there an easy way to do a moving average across a single RDD (in a
non-streaming app). Here is the use case. I have an RDD made up of stock
prices. I want to calculate a moving average using a window size of N.
Thanks.
Asim
Issue resolved after updating the Hbase version to 0.98.8-hadoop2. Thanks Ted
for all the help!
For future reference: This problem has nothing to do with Spark 1.2.0 but
simply because I built Spark 1.2.0 with the wrong Hbase version.
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday,
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes if there is idle capacity.
see spark.locality.wait
http://spark.apache.org/docs/latest/configuration.html
On Tue, Jan 6, 2015 at 10:17 AM, gtinside gtins...@gmail.com wrote:
Does spark
I assume hbase-site.xml is in the classpath.
Can you try the code snippet in standalone program to see if the problem
persists ?
Cheers
On Tue, Jan 6, 2015 at 6:42 AM, Max Xu max...@twosigma.com wrote:
Hi all,
I have a Spark streaming application that ingests data from a Kafka topic
and
I am running into the same problem describe here
https://www.mail-archive.com/user%40spark.apache.org/msg17788.html which for
some reasons does not appear in the archives.
I am having a standalone scala application, build (using sbt) with spark
jars from maven:
org.apache.spark %% spark-core
Default profile is hbase-hadoop1 so you need to specify
-Dhbase.profile=hadoop2
See SPARK-1297
Cheers
On Tue, Jan 6, 2015 at 9:11 AM, Max Xu max...@twosigma.com wrote:
Thanks Ted. You are right, hbase-site.xml is in the classpath. But
previously I have it in the classpath too and the app
I doubt anyone would deploy hbase 0.98.x on hadoop-1
Looks like hadoop2 profile should be made the default.
Cheers
On Tue, Jan 6, 2015 at 9:49 AM, Max Xu max...@twosigma.com wrote:
Awesome. Thanks again Ted. I remember there is a block in the pom.xml
under the example folder that default
Does spark guarantee to push the processing to the data ? Before creating
tasks does spark always check for data location ? So for example if I have 3
spark nodes (Node1, Node2, Node3) and data is local to just 2 nodes (Node1
and Node2) , will spark always schedule tasks on the node for which the
The issue has been sensitive to the number of executors and input data
size. I'm using 2 executors with 4 cores each, 25GB of memory, 3800MB of
memory overhead for YARN. This will fit onto Amazon r3 instance types.
-Sven
On Tue, Jan 6, 2015 at 12:46 AM, Davies Liu dav...@databricks.com wrote:
I
First you'd need to sort the RDD to give it a meaningful order, but I
assume you have some kind of timestamp in your data you can sort on.
I think you might be after the sliding() function, a developer API in MLlib:
[-dev]
What size of graph are you hoping to run this on? For small graphs where
materializing the all-pairs shortest path is an option, you could simply
find the APSP using https://github.com/apache/spark/pull/3619 and then take
the average distance (apsp.map(_._2.toDouble).mean).
Ankur
98 matches
Mail list logo