How can I configure Mesos allocation policy to share resources between all
current Spark applications? I can't seem to find it in the architecture
docs.
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Tue, Nov 4, 2014 at 9:11 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Yes.
Hi David,
Use something like :
Val outputRDD = rdd.flatMap(keyValue = keyValue._2.split(;).map(value =
(keyvalue._1, value)).toArray)
Thanks and Regards,
Suraj Sheth
-Original Message-
From: david [mailto:david...@free.fr]
Sent: Tuesday, November 04, 2014 1:28 PM
To:
This might help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala
Thanks
Best Regards
On Tue, Nov 4, 2014 at 6:03 AM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I was just reading this nice documentation
Hi Gwen,
I have changed the java code kafkawordcount to use reducebykeyandwindow in
spark.
- Messaggio originale -
Da: Gwen Shapira gshap...@cloudera.com
Inviato: 03/11/2014 21:08
A: us...@kafka.apache.org us...@kafka.apache.org
Cc: u...@spark.incubator.apache.org
Hi,
Can some one pleas sugest me, what is the best way to output spark data as
JSON file. (File where each line is a JSON object)
Cheers,
Andrejs
You can look at different modes over here
http://docs.sigmoidanalytics.com/index.php/Spark_On_Mesos#Mesos_Run_Modes
These people has very good tutorial to get you started
http://mesosphere.com/docs/tutorials/run-spark-on-mesos/#overview
Thanks
Best Regards
On Tue, Nov 4, 2014 at 1:44 PM, Romi
I have a single Spark cluster, not multiple frameworks and not multiple
versions. Is it relevant for my use-case?
Where can I find information about exactly how to make Mesos tell Spark how
many resources of the cluster to use? (instead of the default take-all)
*Romi Kuntsman*, *Big Data
You need to install mesos on your cluster. Then you will run your spark
applications by specifying mesos master (mesos://) instead of (spark://).
Spark can run over Mesos in two modes: “*fine-grained*” (default) and “
*coarse-grained*”.
In “*fine-grained*” mode (default), each Spark task runs as
Let's say that I run Spark on Mesos in fine-grained mode, and I have 12
cores and 64GB memory.
I run application A on Spark, and some time after that (but before A
finished) application B.
How many CPUs will each of them get?
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Tue,
Same Issue .. How did you solve it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-spark-SparkEnv-tp10641p18047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Generally this means you included some javax.servlet dependency in
your project deps. You should exclude any of these as they conflict in
this bad way with other copies of the servlet API from Spark.
On Tue, Nov 4, 2014 at 7:55 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I have a
Can you paste the piece of code that you are running?
Thanks
Best Regards
On Tue, Nov 4, 2014 at 3:24 PM, sivarani whitefeathers...@gmail.com wrote:
Same Issue .. How did you solve it?
--
View this message in context:
Thank's
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Hi,
There is a need in my application to query the loaded data into sparkcontext
(I mean loaded SchemaRDD from JSON file(s)). For this purpose, I created the
SchemaRDD and call registerTempTable method in a standalone program and
submited the application using spark-submit command.
Then I have
You can add your custom jar in the SPARK_CLASSPATH inside spark-env.sh file
and restart the cluster to get it shipped on all the workers. Also you can
use the .setJars option and add the jar while creating the sparkContext.
Thanks
Best Regards
On Tue, Nov 4, 2014 at 8:12 AM, Peng Cheng
Tobias,
From http://spark.apache.org/docs/latest/configuration.html it seems
that there is an experimental property:
spark.files.userClassPathFirst
Whether to give user-added jars precedence over Spark's own jars when
loading classes in Executors. This feature can be used to mitigate
Hi Forum,
I am running a simple spark application in 1 master and 1 worker.
Submitting my application through spark submit as a java program. I have
sysout in the program, but I am not finding these sysouts in stdout/stderr
links in web ui of master as well in the SPARK_HOME/work directory.
Michael,
I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?
On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari
Hi All
I am using SparkStreaming..
public class SparkStreaming{
SparkConf sparkConf = new SparkConf().setAppName(Sales);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new
Duration(5000));
String chkPntDir = ; //get checkpoint dir
jssc.checkpoint(chkPntDir);
JavaSpark jSpark =
Folks,
If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
for a transformation/action later, is the whole partition of the RDD
deserialized into Java objects first before my transform/action code works
on it? Or is it deserialized in a streaming manner as the iterator moves
This is not supported yet. It would be great if you could open a JIRA
(though I think apache JIRA is down ATM).
On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu terry@smartfocus.com wrote:
I’m trying to execute a subquery inside an IN clause and am encountering
an unsupported language feature
Hi,
I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...
I am getting a weird error which I have not seen before:
MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
I'm using the same code
https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687,
though still receive
not enough arguments for method sortBy: (f: String = K, ascending:
Boolean, numPartitions: Int)(implicit ord:
Temporary tables are local to the context that creates them (just like
RDDs). I'd recommend saving the data out as Parquet to share it between
contexts.
On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi vdiwakar.mall...@gmail.com
wrote:
Hi,
There is a need in my application to query the
On Tue, Nov 4, 2014 at 8:02 PM, spr s...@yarcdata.com wrote:
To state this another way, it seems like there's no way to straddle the
streaming world and the non-streaming world; to get input from both a
(vanilla, Linux) file and a stream. Is that true?
If so, it seems I need to turn my
Hi everyone,
We've recently added indexing of all Spark resources to
http://search-hadoop.com/spark .
Everything is nicely searchable:
* user dev mailing lists
* JIRA issues
* web site
* wiki
* source code
* javadoc.
Maybe it's worth adding to http://spark.apache.org/community.html ?
Enjoy!
I am trying to create a schema which will look like:
root
|-- ParentInfo: struct (nullable = true)
||-- ID: string (nullable = true)
||-- State: string (nullable = true)
||-- Zip: string (nullable = true)
|-- ChildInfo: struct (nullable = true)
||-- ID: string (nullable =
How do I create a StructField of StructType? I need to create a nested
schema.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Structs are Rows nested in other rows. This might also be helpful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
On Tue, Nov 4, 2014 at 12:21 PM, tridib tridib.sama...@live.com wrote:
How do I create a StructField of StructType? I need to
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
# build (K,V) from A and B to prepare the join
val ja = A.map( r = (K1, Va))
val jb = B.map( r = (K1, Vb))
# join A, B
val jab = ja.join(jb)
# build (K,V) from the joined result of A and B to prepare joining with C
val jc =
It it deserialized in a streaming manner as the iterator moves over the
partition. This is a functionality of core Spark, and Spark Streaming just
uses it as is.
What do you want to customize it to?
On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
If I have an RDD
Go it from a friend - println(model.weights) and println(model.intercept).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Model-characterization-tp17985p18106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Didnt oyu get any errors in the log4j logs, saying that you have to enable
checkpointing?
TD
On Tue, Nov 4, 2014 at 7:20 AM, diogo di...@uken.com wrote:
So, to answer my own n00b question, if case anyone ever needs it. You have
to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed
Hello Tridib,
For you case, you can use StructType(StructField(ParentInfo, parentInfo,
true) :: StructField(ChildInfo, childInfo, true) :: Nil) to create the
StructType representing the schema (parentInfo and childInfo are two
existing StructTypes). You can take a look at our docs (
Thanks for the solution! I did figure out how to create an .egg file to ship
out to the workers. Using ipython seems to be another cool solution.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html
Sent
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
provides a broader set of capabilities than Redshift because it has APIs in
general-purpose languages (Java, Scala, Python) and libraries for things like
machine learning and graph processing. For example, you might use
BTW while I haven't actually used Redshift, I've seen many companies that use
both, usually using Spark for ETL and advanced analytics and Redshift for SQL
on the cleaned / summarized data. Xiangrui Meng also wrote
https://github.com/mengxr/redshift-input-format to make it easy to read data
This is pretty spot on.. though I would also add that the Spark features
that it touts around speed are all dependent on caching the data into
memory... reading off the disk still takes time..ie pulling the data into
an RDD. This is the reason that Spark is great for ML... the data is used
over
There is no one size fits all solution available in the market today. If
somebody tell you they do then they are simply lying :)
Both solutions cater to different set of problems. My recommendation is to
put real focus on getting better understanding of your problems that you
are trying to solve
Hi Nan,Cool. Thanks.
Regards,Ashic.
Date: Tue, 4 Nov 2014 18:26:48 -0500
From: zhunanmcg...@gmail.com
To: as...@live.com
CC: user@spark.apache.org
Subject: Re: Workers not registering after master restart
Hi, Ashic,
this is expected for the latest released
The latest version of PredictionIO, which is now under Apache 2 license,
supports the deployment of MLlib models on production.
The engine you build will including a few components, such as:
- Data - includes Data Source and Data Preparator
- Algorithm(s)
- Serving
I believe that you can do the
Hi,
Amazon aws started to provide service for China mainland, the region
name is cn-north-1. But the script spark provides: spark_ec2.py will query
ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's
no ami information for cn-north-1 region .
Can anybody update the
SchemaRDD supports some of the SQL-like functionality like groupBy(),
distinct(), select(). However, SparkSQL also supports SQL statements which
provide this functionality. In terms of future support and performance, is
it better to use SQL statements or the SchemaRDD methods that provide
Markus,
thanks for your help!
On Tue, Nov 4, 2014 at 8:33 PM, M. Dale medal...@yahoo.com.invalid wrote:
Tobias,
From http://spark.apache.org/docs/latest/configuration.html it seems
that there is an experimental property:
spark.files.userClassPathFirst
Thank you very much, I didn't
I'm fairly new to spark and I'm trying to kick the tires with a few
InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf
instead of a MapReduce Job object. Is there future planned support for the
mapreduce packaging?
Sounds like context would help, I just didn't want to subject people to a
wall of text if it wasn't necessary :)
Currently we use neither Spark SQL (or anything in the Hadoop stack) or
Redshift. We service templated queries from the appserver, i.e. user fills
out some forms, dropdowns: we
They both compile down to the same logical plans so the performance of
running the query should be the same. The Scala DSL uses a lot of Scala
magic and thus is experimental where as HiveQL is pretty set in stone.
On Tue, Nov 4, 2014 at 5:22 PM, SK skrishna...@gmail.com wrote:
SchemaRDD
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
cn-north-1 is not a supported region for EC2, as far as I can tell. There
may be other AWS services that can use that region, but spark-ec2 relies on
EC2.
Nick
On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao
You could take a look at sc.newAPIHadoopRDD()
在 2014年11月5日,上午9:29,Corey Nolet cjno...@gmail.com 写道:
I'm fairly new to spark and I'm trying to kick the tires with a few
InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf
instead of a MapReduce Job object. Is there
I'm afraid not. We have been using EC2 instances in cn-north-1 region for a
while. And the latest version of boto has added the region: cn-north-1
Here's the screenshot:
from boto import ec2
ec2.regions()
[RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
Oh, I can see that region via boto as well. Perhaps the doc is indeed out
of date.
Do you mind opening a JIRA issue
https://issues.apache.org/jira/secure/Dashboard.jspa to track this
request? I can do it if you've never opened a JIRA issue before.
Nick
On Tue, Nov 4, 2014 at 9:03 PM, haitao
Got my answer from this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/no-stdout-output-from-worker-td2437.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-spark-applications-tp18056p18134.html
Sent from the Apache Spark User List
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We carry over extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues`
Hi,
Can Spark achieve whatever GraphX can?
Keeping aside the performance comparison between Spark and GraphX, if I
want to implement any graph algorithm and I do not want to use GraphX, can
I get the work done with Spark?
Than You
Thanks Michael for your response.
Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0
API). But I couldn't find the corresponding documentation. Will that help?
Please let me know.
Thanks in advance.
--
View this message in context:
Hey guys,
I have written a tutorial on deploying MLlib's models on production with
open source PredictionIO: http://docs.prediction.io/0.8.1/templates/
The goal is to add the following features to MLlib, with production
application in mind:
- JSON query to retrieve prediction online
-
Anybody any luck? I am also trying to set NONE to delete key from state, will
null help? how to use scala none in java
My code goes this way
public static class ScalaLang {
public static T OptionT none() {
return (OptionT) None$.MODULE$;
}
I am trying to run the Spark streaming program as given in the Spark streaming
Programming
guidehttps://spark.apache.org/docs/latest/streaming-programming-guide.html,
in the interactive shell. I am getting an error as shown
herefile:///C:\Users\10609685\Desktop\stream-spark.png as an
I've following code in my program. I don't get any error, but it's not
consuming the messages either. Shouldn't the following code print the line
in the 'call' method? What am I missing?
Please help. Thanks.
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
Hi, would you mind describing your problem a little more specific.
1. Is the Kafka broker currently has no data feed in?
2. This code will print the lines, but not in the driver side, the code is
running in the executor side, so you can check the log in worker dir to see if
there’s
The Kafka broker definitely has messages coming in. But your #2 point is
valid. Needless to say I am a newbie to Spark. I can't figure out where
the 'executor' logs would be. How would I find them?
All I see printed on my screen is this:
14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger
this code only expresses a transformation and so does not actually
cause any action. I think you intend to use foreachRDD.
On Wed, Nov 5, 2014 at 5:57 AM, Something Something
mailinglist...@gmail.com wrote:
I've following code in my program. I don't get any error, but it's not
consuming the
From my understanding, the Spark code use Kryo as a streaming manner for RDD
partitions, the deserialization comes with iteration to move forward. But the
internal thing of Kryo to deserialize all the object once or incrementally is
actually a behavior of Kryo, I guess Kyro will not deserialize
Its more like you are having different versions of spark
Thanks
Best Regards
On Wed, Nov 5, 2014 at 3:05 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
I set the host and port of the driver and now the error slightly changed
Using Spark's default log4j profile:
If you’re running on a standalone mode, the log is under SPAR_HOME/work/
directory. I’m not sure for yarn or mesos, you can check the document of Spark
to see the details.
Thanks
Jerry
From: Something Something [mailto:mailinglist...@gmail.com]
Sent: Wednesday, November 05, 2014 2:28 PM
To:
Which error are you referring here? Can you paste the error logs?
Thanks
Best Regards
On Wed, Nov 5, 2014 at 11:04 AM, Suman S Patil suman.pa...@lntinfotech.com
wrote:
I am trying to run the Spark streaming program as given in the Spark
streaming Programming guide
With so many iterations, your RDD lineage is too deep. You should not
need nearly so many iterations. 10 or 20 is usually plenty.
On Tue, Nov 4, 2014 at 11:13 PM, Hongbin Liu hongbin@theice.com wrote:
Hi, can you help with the following? We are new to spark.
Error stack:
14/11/04
Something like this?
val json = myRDD.map(*map_obj* = new JSONObject(*map_obj*))
Here map_obj will be a map containing values (eg: *Map(name - Akhil,
mail - xyz@xyz)*)
Performance wasn't so good with this one though.
Thanks
Best Regards
On Wed, Nov 5, 2014 at 3:02 AM, Yin Huai
Hi
I have simple use case where I have to join two feeds. I have two worker
nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with
yarn(2.4.0).
I have allocated 80% resources to spark queue and my spark config looks like
spark.executor.cores=18
spark.executor.memory=66g
Added foreach as follows. Still don't see any output on my console. Would
this go to the worker logs as Jerry indicated?
JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
mymachine:2181, 1, map);
JavaDStreamString statuses = tweets.map(
new
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can.
On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Can Spark achieve whatever GraphX can?
Keeping aside the performance comparison between Spark and GraphX, if I
want to implement any
How about Using SparkSQL https://spark.apache.org/sql/?
Thanks
Best Regards
On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang bewang.t...@gmail.com wrote:
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
# build (K,V) from A and B to prepare the join
val ja = A.map( r = (K1, Va))
It's not local. My spark url is something like this:
String sparkUrl = spark://host name:7077;
On Tue, Nov 4, 2014 at 11:03 PM, Jain Rahul ja...@ivycomptech.com wrote:
I think you are running it locally.
Do you have local[1] here for master url? If yes change it to local[2] or
I think you are running it locally.
Do you have local[1] here for master url? If yes change it to local[2] or more
number of threads.
It may be due to topic name mismatch also.
sparkConf.setMaster(“local[1]);
Regards,
Rahul
From: Something Something
Your code doesn't trigger any action. How about the following?
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new
Duration(60 * 1 * 1000));
JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc,
machine:2181, 1, map);
JavaDStreamString statuses =
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't
support Hash join in this version.
On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about Using SparkSQL https://spark.apache.org/sql/?
Thanks
Best Regards
On Wed, Nov 5, 2014 at 1:53
Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241
Thanks.
2014-11-05 10:58 GMT+08:00 Nicholas Chammas nicholas.cham...@gmail.com:
Oh, I can see that region via boto as well. Perhaps the doc is indeed out
of date.
Do you mind opening a JIRA issue
Oh, in that case, if you want to reduce the GC time, you can specify the
level of parallelism along with your join, reduceByKey operations.
Thanks
Best Regards
On Wed, Nov 5, 2014 at 1:11 PM, Benyi Wang bewang.t...@gmail.com wrote:
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is
what is the best way to implement a sparse x sparse matrix multiplication
with spark?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
79 matches
Mail list logo