Hi,
I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment you perform some
map/reduce actions. Is this true?
If this is the case, how can I enforce Spark to cache immediately at its
cache() statement? I need this to
And one more thing, the given tupes
(1, 1.0)
(2, 1.0)
(3, 2.0)
(4, 2.0)
(5, 0.0)
are a part of RDD and they are not just tuples.
graph.vertices return me the above tuples which is a part of VertexRDD.
On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
This is just
Set spark.storage.memoryFraction flag to 1 while creating the sparkContext
to utilize upto 73Gb of your memory, default it 0.6 and hence you are
getting 33.6Gb. Also set rdd.compression and StorageLevel as
MEMORY_ONLY_SER if your data is kind of larger than your available memory.
(you could try
Try running it in local mode. Looks like a jar conflict/missing.
SparkConf conf = new SparkConf().setAppName(JavaWordCount);
conf.set(spark.io.compression.codec,org.apache.spark.io.LZ4CompressionCodec);
conf.setMaster(*local[2]*).setSparkHome(System.getenv(SPARK_HOME));
JavaSparkContext jsc = new
You could go through these to start with
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark
Thanks
Best Regards
On Wed, Dec 3, 2014 at
Hi,
Yes, as Jerry mentioned, the Spark -3129 (
https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature
which solves the Driver failure problem. The way 3129 is designed , it
solved the driver failure problem agnostic of the source of the stream (
like Kafka or Flume etc) But
Using sbt-assemble I'm creating a fat jar that includes spark and akka. I've
encountered this error:
[error]
/home/dev/.ivy2/cache/com.typesafe.akka/akka-actor_2.10/jars/akka-actor_2.10-2.3.4.jar:akka/util/ByteIterator$$anonfun$getLongPart$1.class
[error]
You can't append to a file with spark using the native saveAs* calls, it
will always check if the directory already exists and if yes, it will throw
error. People usually use hadoop's getMerge utilities to combine the
output.
Thanks
Best Regards
On Tue, Dec 2, 2014 at 8:10 PM, Csaba Ragany
Hi Experts!
Is there a way to read first N messages from kafka stream and put them in
some collection and return to the caller for visualization purpose and close
spark streaming.
I will be glad to hear from you and will be thankful to you.
Currently I have following code that
def
Hi,
Can a UDF return a list of values that can be used in a WHERE clause?
Something like:
sqlCtx.registerFunction(myudf, {
Array(1, 2, 3)
})
val sql = select doc_id, doc_value from doc_table where doc_id in
myudf()
This does not work:
Exception in thread main
Which hbase release are you running ?
If it is 0.98, take a look at:
https://issues.apache.org/jira/browse/SPARK-1297
Thanks
On Dec 2, 2014, at 10:21 PM, Jai jaidishhari...@gmail.com wrote:
I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase
Cluster and I am looking for
Hi,
I have an RDD and a function that should be called on every item in this
RDD once (say it updates an external database). So far, I used
rdd.map(myFunction).count()
or
rdd.mapPartitions(iter = iter.map(myFunction))
but I am wondering if this always triggers the call of myFunction in both
This is just an example but if my graph is big, there will be so many
tuples to handle. I cannot manually do
val a: RDD[(Int, Double)] = sc.parallelize(List(
(1, 1.0),
(2, 1.0),
(3, 2.0),
(4, 2.0),
(5, 0.0)))
for all the vertices in the graph.
What should I do in that
dump your classpath, looks like you have multiple versions of guava jars in
the classpath.
Thanks
Best Regards
On Wed, Dec 3, 2014 at 2:30 PM, Rahul Swaminathan
rahul.swaminat...@duke.edu wrote:
I’ve tried that and the same error occurs. Do you have any other
suggestions?
Thanks!
Rahul
I've tried that and the same error occurs. Do you have any other suggestions?
Thanks!
Rahul
From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com
Date: Wednesday, December 3, 2014 at 3:55 AM
To: Rahul Swaminathan
rahul.swaminat...@duke.edumailto:rahul.swaminat...@duke.edu
Hi Guys,
I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can
save it in parquet format.
A record in my RDD has the following format:
RDD1
{
field1:5,
field2: 'string',
field3: {'a':1, 'c':2}
}
I am using field3 to represent a sparse vector and it can have keys:
Hey Venkat,
This behavior seems reasonable. According to the table name, I guess
here |DAgents| should be the fact table and |ContactDetails| is the dim
table. Below is an explanation of a similar query, you may see |src| as
|DAgents| and |src1| as |ContactDetails|.
|0:
At 2014-12-03 02:13:49 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
We cannot do sc.parallelize(List(VertexRDD)), can we?
There's no need to do this, because every VertexRDD is also a pair RDD:
class VertexRDD[VD] extends RDD[(VertexId, VD)]
You can simply use graph.vertices in
At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a graph which returns the following on doing graph.vertices
(1, 1.0)
(2, 1.0)
(3, 2.0)
(4, 2.0)
(5, 0.0)
I want to group all the vertices with the same attribute together, like into
one RDD or something. I
Hi,
I am trying to use textFileStream(some_hdfs_location) to pick new files
from a HDFS location.I am seeing a pretty strange behavior though.
textFileStream() is not detecting new files when I move them from a
location with in hdfs to location at which textFileStream() is checking for
new files.
You could do something like:
val stream = kafkaStream.getStream().repartition(1).mapPartitions(x = x.
take(*10*))
Here stream will have 10 elements from the kafakaStream.
Thanks
Best Regards
On Wed, Dec 3, 2014 at 1:05 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi Experts!
Is
Hi Akhil!
Thanks for your response. Can you please suggest me how to return this
sample from a function to the caller and stopping SparkStreaming
Thanks
--
View this message in context:
On Wed, Dec 3, 2014 at 10:52 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment you perform some
map/reduce actions. Is this true?
Yes, this is correct.
If this is
Andrew and developers, thank you for excellent release!
It fixed almost all of our issues. Now we are migrating to Spark from Zoo of
Python, Java, Hive, Pig jobs.
Our Scala/Spark jobs often failed on 1.1. Spark 1.1.1 works like a Swiss
watch.
--
View this message in context:
Hi everyOne!
I want to convert a DStream[String] into an RDD[String]. I could not find
how to do this.
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder,
DefaultDecoder](ssc, consumerConfig, topicMap,
StorageLevel.MEMORY_ONLY).map(_._2)
val streams =
The following code is failing on the collect. If I don't do the collect and go
with a JavaRDDDocument it works fine. Except I really would like to collect.
At first I was getting an error regarding JDI threads and an index being 0.
Then it just started locking up. I'm running the spark context
I didn't realize I do get a nice stack trace if not running in debug mode.
Basically, I believe Document has to be serializable.
But since the question has already been asked, are the other requirements for
objects within an RDD that I should be aware of. serializable is very
understandable.
DStream.foreachRDD gives you an RDD[String] for each interval of
course. I don't think it makes sense to say a DStream can be converted
into one RDD since it is a stream. The past elements are inherently
not supposed to stick around for a long time, and future elements
aren't known. You may
Thanks Dear, It is good to save this data to HDFS and then load back into an
RDD :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/converting-DStream-String-into-RDD-String-in-spark-streaming-tp20253p20258.html
Sent from the Apache Spark User List mailing
Yes,
otherwise you can try:
rdd.cache().count()
and then run your benchmark
Paolo
Da: Daniel Darabosmailto:daniel.dara...@lynxanalytics.com
Data invio: ?mercoled?? ?3? ?dicembre? ?2014 ?12?:?28
A: shahabmailto:shahab.mok...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
On
My main complain about the WAL mechanism in the new reliable kafka receiver
is that you have to enable checkpointing and for some reason, even if
spark.cleaner.ttl is set to a reasonable value, only the metadata is
cleaned periodically. In my tests, using a folder in my filesystem as the
take(1000) merely takes the first 1000 elements of an RDD. I don't
imagine that's what the OP means. filter() is how you select a subset
of elements to work with. Yes, this requires evaluating the predicate
on all 10M elements, at least once. I don't think you could avoid this
in general, right,
I am using Spark 1.1.1. I am seeing an issue that only appears when I run in
standalone clustered mode with at least 2 workers. The workers are on
separate physical machines.
I am performing a simple join on 2 RDDs. After the join I run first() on
the joined RDD (in Scala) to get the first
Probabilities won't sum to 1 since this expression doesn't incorporate
the probability of the evidence, I imagine? it's constant across
classes so is usually excluded. It would appear as a -
log(P(evidence)) term.
On Tue, Dec 2, 2014 at 10:44 AM, MariusFS marius.fete...@sien.com wrote:
Are we
Hello everybody,
in case you missed DataBricks and Berkeley have announced a free mooc on
spark and another one on scalable machine learning using spark. Both
courses are free but if you want to have a verified certificate of
completion you need to donate at least 50$. I did it, it's a great
Hi,
I'm trying the Elasticsearch support for Spark (2.1.0.Beta3).
In the following I provide the query (as query dsl):
import org.elasticsearch.spark._
object TryES {
val sparkConf = new SparkConf().setAppName(Campaigns)
sparkConf.set(es.nodes, es_cluster:9200)
Ideas?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .
On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Bharath,
You can try setting a
hmm..
33.6gb is sum of the memory used by the two RDD that is cached. You're
right when I put serialized RDDs in the cache, the memory foot print for
these rdds become a lot smaller.
Serialized Memory footprint shown below:
RDD NameStorage Level Cached Partitions Fraction Cached
Hi all,
I am having troubles using Kryo and being new to this kind of
serialization, I am not sure where to look. Can someone please help me? :-)
Here is my custom class:
public class *DummyClass* implements KryoSerializable {
private static final Logger LOGGER =
I am trying to do the same thing and also wondering what the best strategy is.
Thanks
From: ll duy.huynh@gmail.com
Sent: Wednesday, December 3, 2014 10:28 AM
To: u...@spark.incubator.apache.org
Subject: what is the best way to implement mini batches?
Hi folks,
I'm wondering if someone has successfully used wildcards with a parquetFile
call?
I saw this thread and it makes me think no?
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCACA1tWLjcF-NtXj=pqpqm3xk4aj0jitxjhmdqbojj_ojybo...@mail.gmail.com%3E
I have a set
I think, the memory calculation is correct, what I didn't account for is the
memory used. I am still puzzled as how I can successfully process the RDD
in spark.
--
View this message in context:
I'm trying to implement a graph algorithm that does a form of path
searching. Once a certain criteria is met on any path in the graph, I
wanted to halt the rest of the iterations. But I can't see how to do that
with the Pregel API, since any vertex isn't able to know the state of other
arbitrary
I don't have a great answer for you. For us, we found a common divisor, not
necessarily a whole gigabyte, of the available memory of the different
hardware and used that as the amount of memory per worker and scaled the
number of cores accordingly so that every core in the system has the same
Just wondered if anyone had managed to start spark
jobs on mesos wrapped in a docker container?
At present (i.e. very early testing) I'm able to submit executors
to mesos via spark-submit easily enough, but they fall over
as we don't have a JVM on our slaves out of the box.
I can push one out
hello,
im running spark on stand alone station and im try to view the event log
after the run is finished
i turned on the event log as the site said (spark.eventLog.enabled set to
true)
but i can't find the log files or get the web ui to work. any idea on how
to do this?
thanks
Isca
Thanks for the help.. Let me find more info on how to enable statistics in
parquet.
-Vishnu
Michael Armbrust wrote
There is not a super easy way to do what you are asking since in general
parquet needs to read all the data in a column. As far as I understand it
does not have indexes that
Daniel and Paolo, thanks for the comments.
best,
/Shahab
On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter paolo.plat...@agilelab.it
wrote:
Yes,
otherwise you can try:
rdd.cache().count()
and then run your benchmark
Paolo
*Da:* Daniel Darabos daniel.dara...@lynxanalytics.com
Hi All,I am using LinearRegressionWithSGD and then I save the model weights and
intercept. File that contains weights have this format:
1.204550.13560.000456..
Intercept is 0 since I am using train not setting the intercept so it can be
ignored for the moment. I would now like to initialize
inferSchema() will work better than jsonRDD() in your case,
from pyspark.sql import Row
srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x)))
srdd.first()
Row( field1=5, field2='string', field3={'a'=1, 'c'=2})
On Wed, Dec 3, 2014 at 12:11 AM, sahanbull sa...@skimlinks.com wrote:
Hi
I'm trying to implement a graph algorithm that does a form of path
searching. Once a certain criteria is met on any path in the graph, I
wanted to halt the rest of the iterations. But I can't see how to do that
with the Pregel API, since any vertex isn't able to know the state of other
arbitrary
shahabm wrote
I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment you perform some
map/reduce actions. Is this true?
Yes, .cache() is a transformation (lazy evaluation)
shahabm wrote
If this is the case, how can I
Hi,
Try using
sc.newAPIHadoopFile(hdfs path to your file,
AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
your Configuration)
You will get the Avro related classes by importing org.apache.avro.*
Thanks.
On Tue, Dec 2, 2014 at 9:23 PM, leaviva [via Apache Spark User
For others who may be having a similar problem:
The error below occurs when using Yarn, which uses an earlier version of Guava
compared to Spark 1.1.0. When packaging using Maven, if you put the Yarn
dependency above the Spark dependency, the earlier version of guava is the one
that gets
It won't work until this is merged:
https://github.com/apache/spark/pull/3407
On Wed, Dec 3, 2014 at 9:25 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks,
I'm wondering if someone has successfully used wildcards with a
parquetFile call?
I saw this thread and it makes me think no?
Hi,
Add the jars in the external library of you related project.
Right click on package or class - Build Path - Configure Build Path -
Java Build Path - Select the Libraries tab - Add external library -
Browse to com.xxx.yyy.zzz._ - ok
Clean and build your project, most probably you will be able
nsareen wrote
1) Does filter function scan every element saved in RDD? if my RDD
represents 10 Million rows, and if i want to work on only 1000 of them,
is
there an efficient way of filtering the subset without having to scan
every element ?
using .take(1000) may be a biased sample.
you
also available is .sample(), which will randomly sample your RDD with or
without replacement, and returns an RDD.
.sample() takes a fraction, so it doesn't return an exact number of
elements.
eg.
rdd.sample(true, .0001, 1)
--
View this message in context:
I'm not sure about .union(), but at least in the case of .join(), as long as
you have hash partitioned the original RDDs and persisted them, calls to
.join() take advantage of already knowing which partition the keys are on,
and will not repartition rdd1.
val rdd1 = log.partitionBy(new
About version compatibility and upgrade path - can the Java application
dependencies and the Spark server be upgraded separately (i.e. will 1.1.0
library work with 1.1.1 server, and vice versa), or do they need to be
upgraded together?
Thanks!
*Romi Kuntsman*, *Big Data Engineer*
By the Spark server do you mean the standalone Master? It is best if they
are upgraded together because there have been changes to the Master in
1.1.1. Although it might just work, it's highly recommended to restart
your cluster manager too.
2014-12-03 13:19 GMT-08:00 Romi Kuntsman
Yeah this is currently broken for 1.1.1. I will submit a fix later today.
2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu
:
+Andrew
Actually I think this is because we haven't uploaded the Spark binaries to
cloudfront / pushed the change to mesos/spark-ec2.
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there
was actually some ongoing work for this.
Matei
On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote:
Just wondered if anyone had managed to start spark
jobs on mesos wrapped in a docker
Because this was a maintenance release, we should not have introduced any
binary backwards or forwards incompatibilities. Therefore, applications
that were written and compiled against 1.1.0 should still work against a
1.1.1 cluster, and vice versa.
On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or
This should be fixed now. Thanks for bringing this to our attention.
2014-12-03 13:31 GMT-08:00 Andrew Or and...@databricks.com:
Yeah this is currently broken for 1.1.1. I will submit a fix later today.
2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman
shiva...@eecs.berkeley.edu:
+Andrew
Hi All,
My question is about lazy running mode for SchemaRDD, I guess. I know lazy
mode is good, however, I still have this demand.
For example, here is the first SchemaRDD, named result.(select * from table
where num1 and num 4):
results: org.apache.spark.sql.SchemaRDD =
SchemaRDD[59] at RDD
I think it would depend on the type and amount of information you're
collecting.
If you're just trying to collect small numbers for each window, and don't
have an overwhelming number of windows, you might consider using
accumulators. Just make one per value per time window, and for each data
We are using Spark job server to submit spark jobs (our spark version is 0.91).
After running the spark job server for a while, we often see the following
errors (executor lost) in the spark job server log. As a consequence, the spark
driver (allocated inside spark job server) gradually loses
You want to look further up the stack (there are almost certainly other errors
before this happens) and those other errors may give your better idea of what
is going on. Also if you are running on yarn you can run yarn logs
-applicationId yourAppId to get the logs from the data nodes.
Sent
do these requirements boils down to a need for foldLeftByKey with sorting
of the values?
https://issues.apache.org/jira/browse/SPARK-3655
On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu ben...@gmail.com wrote:
I have similar requirememt,take top N by key. right now I use
groupByKey,but one key
Hi,
In the talk A Deeper Understanding of Spark Internals, it was mentioned
that for some operators, spark can spill to disk across keys (in 1.1 -
.groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of
the shuffle at that time, each single key-value pair must fit in memory.
I'm wondering how to do this kind of SQL query with PairRDDFunctions.
SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip
In the Spark scala API, I can make an RDD (called users) of key-value
pairs where the keys are zip (as in ZIP code) and the values are user id's.
Then I can
Hi,
On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj jerry@gmail.com wrote:
Exception in thread main java.lang.RuntimeException: [1.57] failure:
``('' expected but identifier myudf found
I also tried returning a List of Ints, that did not work either. Is there
a way to write a UDF that returns
bq. to get the logs from the data nodes
Minor correction: the logs are collected from machines where node managers
run.
Cheers
On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
You want to look further up the stack (there are almost certainly other
errors
I have been working on balancing work across a number of partitions and
find it would be useful to access information about the current execution
environment much of which (like Executor ID) are available if there was a
way to get the current executor or the Hadoop TaskAttempt context -
does any
Hi,
On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain bahub...@gmail.com wrote:
I am trying to use textFileStream(some_hdfs_location) to pick new files
from a HDFS location.I am seeing a pretty strange behavior though.
textFileStream() is not detecting new files when I move them from a
location
looks good.
I concern about the foldLeftByKey which looks break the consistence from
foldLeft in RDD and aggregateByKey in PairRDD
Yours, Xuefeng Wu 吴雪峰 敬上
On 2014年12月4日, at 上午7:47, Koert Kuipers ko...@tresata.com wrote:
foldLeftByKey
Hi,
On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab as...@live.com wrote:
I've been doing this with foreachPartition (i.e. have the parameters for
creating the singleton outside the loop, do a foreachPartition, create the
instance, loop over entries in the partition, close the partition), but
I'd like to tag a question onto this; has anybody attempted to deploy spark
under Kubernetes
https://github.com/googlecloudplatform/kubernetes or Kubernetes mesos (
https://github.com/mesosphere/kubernetes-mesos ) .
On Wednesday, December 3, 2014, Matei Zaharia matei.zaha...@gmail.com
wrote:
I'd
Hi,
I try to read data from DynamoDB table with Spark, but after I run this
code I got an error massege like in below.
I use Spark 1.1.1 and emr-core-1.1.jar, emr-ddb-hive-1.0.jar and
emr-ddb-hadoop-1.0.jar.
valsparkConf = SparkConf().setAppName(DynamoRdeader).setMaster(local[4])
valctx =
Hi,
If I create a SchemaRDD from a file that I know is sorted on a certain
field, is it possible to somehow pass that information on to Spark SQL
so that SQL queries referencing that field are optimized?
Thanks
-Jerry
-
To
Guys,
In my local machine it consumes a stream of Kinesis with 3 shards. But in EC2
it does not consume from the stream. Later we found that the EC2 machine was of
2 cores and my local machine was of 4 cores. I am using a single machine and in
spark standalone mode. And we got a larger machine
You may do this:
|table(users).groupBy('zip)('zip, count('user), countDistinct('user))
|
On 12/4/14 8:47 AM, Arun Luthra wrote:
I'm wondering how to do this kind of SQL query with PairRDDFunctions.
SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip
In the Spark scala API,
Hi,
I am using spark-submit to submit my application to YARN in yarn-cluster
mode. I have both the Spark assembly jar file as well as my application jar
file put in HDFS and can see from the logging output that both files are
used from there. However, it still takes about 10 seconds for my
Hi,
I am using spark with version number 1.1.0 on an EC2 cluster. After I
submitted the job, it returned an error saying that a python module cannot
be loaded due to missing files. I am using the same command that used to
work on an private cluster before for submitting jobs and all the source
Yes I agree, and it may also be ambiguous in semantic. A list of objects V.S. A
list with single List Object.
I’ve also tested that, seems
a. There is a bug in registerFunction, which doesn’t support the UDF
without argument. ( I just create a PR for this:
You can try to write your own Relation with filter push down or use the
ParquetRelation2 for workaround.
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)
Cheng Hao
-Original Message-
From: Jerry Raj
Hi All,
I have a standalone Spark(1.1) cluster on one machine and I have installed
scala Eclipse IDE (scala 2.10) on my desktop. I am trying to execute a spark
code to execute over my standalone cluster but getting errors.
Please guide me to resolve this.
Code:
val logFile = File Path present
Hi All,
I am doing model training using Spark MLLIB inside our hadoop cluster. But
prediction happens in a different realtime synchronous system(Web
application). I am currently exploring different options to export the
trained Mllib models from spark.
1. *Export model as PMML:* I found the
Markus,
On Tue, Nov 11, 2014 at 10:40 AM, M. Dale medal...@yahoo.com wrote:
I never tried to use this property. I was hoping someone else would jump
in. When I saw your original question I remembered that Hadoop has
something similar. So I searched and found the link below. A quick JIRA
To get that function in scope you have to import
org.apache.spark.SparkContext._
Ankur
On Wednesday, December 3, 2014, Deep Pradhan pradhandeep1...@gmail.com
wrote:
But groupByKey() gives me the error saying that it is not a member of
org.apache.spark,rdd,RDD[(Double,
Hi Experts
I am using Spark Streaming to integrate Kafka for real time data processing.
I am facing some issues related to Spark Streaming
So I want to know how can we detect
1) Our connection has been lost
2) Our receiver is down
3) Spark Streaming has no new messages to consume.
how can we deal
You'll have to decide which is more expensive in your heterogenous
environment and optimize for the utilization of that. For example, you may
decide that memory is the only costing factor and you can discount the
number of cores. Then you could have 8GB on each worker each with four
cores. Note
It seems you provided master url as spark://10.112.67.80:7077 , i think you
should give spark://ubuntu:7077 instead.
Thanks
Best Regards
On Thu, Dec 4, 2014 at 11:35 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
Hi All,
I have a standalone Spark(1.1) cluster on one machine and I have
You can setup nagios based monitoring for these, also setting up a high
availability environment will be more fault tolerant.
Thanks
Best Regards
On Thu, Dec 4, 2014 at 12:17 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi Experts
I am using Spark Streaming to integrate Kafka for real
Oh great, thanks Tim - I'll give that a whirl then.
A Spark rollout isn't in our immediate future, I'm just looking for a good
framework for compute alongside our marathon deployment. So a good
time to experiment!
On 4 December 2014 at 02:36, Tim Chen t...@mesosphere.io wrote:
Hi Dick,
There
Hi,
I am newbie in Spark and performed following steps during POC execution:
1. Map csv file to object-file after some transformations once.
2. Serialize object-file to RDD for operation, as per need.
In case of 2 csv/object-files, first object-file is serialized to RDD
successfully but during
Hi,
I was just thinking about necessity for rdd replication. One category could
be something like large number of threads requiring same rdd. Even though,
a single rdd can be shared by multiple threads belonging to same
application , I believe we can extract better parallelism if the rdd is
On Wed, Dec 3, 2014 at 8:17 PM, chocjy jiyanyan...@gmail.com wrote:
Hi,
I am using spark with version number 1.1.0 on an EC2 cluster. After I
submitted the job, it returned an error saying that a python module cannot
be loaded due to missing files. I am using the same command that used to
100 matches
Mail list logo