There are several overloaded versions of both |jsonFile| and |jsonRDD|.
Schema inferring is kinda expensive since it requires an extra Spark
job. You can avoid schema inferring by storing the inferred schema and
then use it together with the following two methods:
* |def jsonFile(path:
The implementation assumes classes are 0-indexed, not 1-indexed. You
should set numClasses = 3 and change your labels to 0, 1, 2.
On Thu, Dec 11, 2014 at 3:40 AM, Ge, Yao (Y.) y...@ford.com wrote:
I am testing decision tree using iris.scale data set
Hi,
We use the following Spark Streaming code to collect and process Kafka
event :
kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
process(event._1, event._2)
})
})
This work fine.
But without /collect()/ function, the following exception is
Have you tried with kafkaStream.foreachRDD(rdd = {rdd.foreach(...)} ?
Would that make a difference?
On Thu, Dec 11, 2014 at 10:24 AM, david david...@free.fr wrote:
Hi,
We use the following Spark Streaming code to collect and process Kafka
event :
kafkaStream.foreachRDD(rdd = {
If the timestamps in the logs are to be trusted It looks like your driver
is dying with that *java.io.FileNotFoundException*: and therefore the
workers loose their connection and close down.
-kr, Gerard.
On Thu, Dec 11, 2014 at 7:39 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try to add
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found
Hi,
I am using spark 1.1.0 and setting below properties while creating spark
context.
*spark.executor.logs.rolling.maxRetainedFiles = 10*
*spark.executor.logs.rolling.size.maxBytes = 104857600*
*spark.executor.logs.rolling.strategy = size*
Even though I am setting to rollover after 100 MB,
Imagine simple Spark job, that will store each line of the RDD to a
separate file
val lines = sc.parallelize(1 to 100).map(n = sthis is line $n)
lines.foreach(line = writeToFile(line))
def writeToFile(line: String) = {
def filePath = file://...
val file = new File(new URI(path).getPath)
Can we take this as a performance improvement task in Spark-1.2.1? I can help
contribute for this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
Sent from the Apache Spark User List mailing
Following Gerard's thoughts, here are possible things that could be happening.
1. Is there another process in the background that is deleting files
in the directory where you are trying to write? Seems like the
temporary file generated by one of the tasks is getting delete before
it is renamed to
Actually I came to a conclusion that RDDs has to be persisted in hive in
order to be able to access through thrift.
Hope I didn't end up with incorrect conclusion.
Please someone correct me if I am wrong.
On Dec 11, 2014 8:53 AM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Looks like you
What does process do? Maybe when this process function is being run in
the Spark executor, it is causing the some static initialization,
which fails causing this exception. For Oracle documentation,
an ExceptionInInitializerError is thrown to indicate that an exception
occurred during evaluation
Hi,
I was wondering if there's any way of having long running session type
behaviour in spark. For example, let's say we're using Spark Streaming to
listen to a stream of events. Upon receiving an event, we process it, and if
certain conditions are met, we wish to send a message to rabbitmq.
--
Code
--
scala import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext._
scala import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD
scala import org.apache.spark.sql.SchemaRDD
Hi,
I'm trying to use spark-streaming with kafka but I get a strange error
on class that are missing. I would like to ask if my way to build the
fat jar is correct or no. My program is
val kafkaStream = KafkaUtils.createStream(ssc, zookeeperQuorum,
kafkaGroupId, kafkaTopicsWithThreads)
You could create a lazily initialized singleton factory and connection
pool. Whenever an executor starts running the firt task that needs to
push out data, it will create the connection pool as a singleton. And
subsequent tasks running on the executor is going to use the
connection pool. You will
That makes sense. I'll try that.
Thanks :)
From: tathagata.das1...@gmail.com
Date: Thu, 11 Dec 2014 04:53:01 -0800
Subject: Re: Session for connections?
To: as...@live.com
CC: user@spark.apache.org
You could create a lazily initialized singleton factory and connection
pool. Whenever an
I am updating the docs right now. Here is a staged copy that you can
have sneak peek of. This will be part of the Spark 1.2.
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
The updated fault-tolerance section tries to simplify the explanation
of when and what data
Hi,
To test Spark SQL Vs CQL performance on Cassandra, I did the following:
1) Cassandra standalone server (1 server in a cluster)
2) Spark Master and 1 Worker
Both running in a Thinkpad laptop with 4 cores and 8GB RAM.
3) Written Spark SQL code using Cassandra-Spark Driver from Cassandra
Yes, this is perfectly legal. This is what RDD.foreach() is for! You may
be encountering an IO exception while writing, and maybe using() suppresses
it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
expect there is less that can go wrong with that simple call.
On Thu, Dec
Hello.
I'm pretty new with Spark
I am developing an Spark application, conducting the test on local prior to
deploy it on a cluster. I have a problem with a broacast variable. The
application raises
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task
In this way it works but it's not portable and the idea of having a fat
jar is to avoid exactly this. Is there any system to create a
self-contained portable fatJar?
On 11.12.2014 13:57, Akhil Das wrote:
Add these jars while creating the Context.
val sc = new SparkContext(conf)
Yes. You can do/use *sbt assembly* and create a big fat jar with all
dependencies bundled inside it.
Thanks
Best Regards
On Thu, Dec 11, 2014 at 7:10 PM, Mario Pastorelli
mario.pastore...@teralytics.ch wrote:
In this way it works but it's not portable and the idea of having a fat
jar is to
Thanks akhil for the answer.
I am using sbt assembly and the build.sbt is in the first email. Do you
know why those classes are included in that way?
Thanks,
Mario
On 11.12.2014 14:51, Akhil Das wrote:
Yes. You can do/use *sbt assembly* and create a big fat jar with all
dependencies
Also, this is covered in the streaming programming guide in bits and pieces.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote:
That makes sense. I'll try that.
Thanks :)
I'm doing the same thing for using Cassandra,
For Cassandra, use the Spark-Cassandra connector [1], which does the
Session management, as described by TD, for you.
[1] https://github.com/datastax/spark-cassandra-connector
-kr, Gerard.
On Thu, Dec 11, 2014 at 1:55 PM, Ashic Mahtab
I see -- they are the same in design but the difference comes from
partitioned Hive tables: when the RDD is generated by querying an external
Hive metastore, the partition is appended as part of the row, and shows up
as part of the schema. Can you shed some light on why this is a problem:
Aditya, I think you have the mental model of spark streaming a little
off the mark. Unlike traditional streaming systems, where any kind of
state is mutable, SparkStreaming is designed on Sparks immutable RDDs.
Streaming data is received and divided into immutable blocks, then
form immutable RDDs,
First of all, how long do you want to keep doing this? The data is
going to increase infinitely and without any bounds, its going to get
too big for any cluster to handle. If all that is within bounds, then
try the following.
- Maintain a global variable having the current RDD storing all the
log
Not that I am aware of. Spark will try to spread the tasks evenly
across executors, its not aware of the workers at all. So if the
executors to worker allocation is uneven, I am not sure what can be
done. Maybe others can get smoe ideas.
On Tue, Dec 9, 2014 at 6:20 AM, Gerard Maas
Hello guys,
Thank you for your prompt reply.
I followed Akhil suggestion with no success. Then, I tried again replacing
S3 by HDFS and the job seems to work properly.
TD, I'm not using speculative execution.
I think I've just realized what is happening. Due to S3 eventual
consistency, these
Yes, that is correct. A quick reference on this is the post
https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
with the pertinent section being:
It is important to note that when you create Spark tables (for
Hi
It worked for me like this. Just define the case class outside of any class
to write to parquet format successfully. I am using Spark version 1.1.1.
case class person(id: Int, name: String, fathername: String, officeid: Int)
object Program {
def main (args: Array[String]) {
val
Hi
saveAsTextFile is a member of RDD where as
fields.map(_.mkString(|)).mkString(\n) is a string. You have to
transform it into RDD using something like sc.parallel(...) before
saveAsTextFile.
Thanks
--
View this message in context:
Hi guys,
I'm planning to use spark on a project and I'm facing a problem, I
couldn't find a log that explains what's wrong with what I'm doing.
I have 2 vms that run a small hadoop (2.6.0) cluster. I added a file that
has a 50 lines of json data
Compiled spark, all tests passed, I run some
Hi,
I've made a simple script in scala that after doing a spark sql query it
sends the result to AWS's cloudwatch.
I've tested both parts individually (the spark sql one and the
cloudwatch one) and they worked fine. The trouble comes when I execute
the script through spark-submit that gives me
Hello all,
I'm using GraphX (1.1.0) to process RDF-data. I want to build an graph out
of the data from the Berlin Benchmark ( BSBM
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
).
The steps that I'm doing to load the data into a graph are:
*1.* Split the RDF triples
hi. i'm running into this OutOfMemory issue when i'm broadcasting a large
array. what is the best way to handle this?
should i split the array into smaller arrays before broadcasting, and then
combining them locally at each node?
thanks!
--
View this message in context:
any explaination on how aggregate works would be much appreciated. i already
looked at the spark example and still am confused about the seqop and
combop... thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html
Sent from
any advice/comment on this would be much appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
hi.. i'm converting some of my machine learning python code into scala +
spark. i haven't been able to run it on large dataset yet, but on small
datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code is
much slower than my python code (5 to 10 times slower than python)
i
Hi,
I'm trying to set a custom spark app name when running a java spark app in
yarn-cluster mode.
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster(System.getProperty(spark.master));
sparkConf.setAppName(myCustomName);
sparkConf.set(spark.logConf, true);
JavaSparkContext sc =
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com
wrote:
Hi,
I'm trying to set a custom spark app name when running a java spark app in
yarn-cluster mode.
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster(System.getProperty(spark.master));
There's some explanation and an example here:
http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246
-kr, Gerard.
On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote:
any explaination on how aggregate works would be much appreciated. i
Hi Mario,
Try to include this to your libraryDependencies (in your sbt file):
org.apache.kafka % kafka_2.10 % 0.8.0
exclude(javax.jms, jms)
exclude(com.sun.jdmk, jmxtools)
exclude(com.sun.jmx, jmxri)
exclude(org.slf4j, slf4j-simple)
Regards,
*--Flávio R. Santos*
Chaordic |
Are you using Scala in a distributed enviroment or in a standalone mode ?
Natu
On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:
hi.. i'm converting some of my machine learning python code into scala +
spark. i haven't been able to run it on large dataset yet, but on small
both.
first, the distributed version is so much slower than python. i tried a
few things like broadcasting variables, replacing Seq with Array, and a few
other little things. it helps to improve the performance, but still slower
than the python code.
so, i wrote a local version that's pretty
just to give some reference point. with the same algorithm running on
mnist dataset.
1. python implementation: ~10 miliseconds per iteration (can be faster if
i switch to gpu)
2. local version (scala + breeze): ~2 seconds per iteration
3. distributed version (spark + scala + breeze): 15
You can just do mapPartitions on the whole RDD, and then called sliding() on
the iterator in each one to get a sliding window. One problem is that you will
not be able to slide forward into the next partition at partition boundaries.
If this matters to you, you need to do something more
Hi,
Is there a way to check the status of the SparkContext regarding whether
it's alive or not through the code, not through UI or anything else?
Thanks
Edwin
--
View this message in context:
In general, you would not expect a distributed computation framework
to perform nearly as fast as a non-distributed one, when both are run
on one machine. Spark has so much more overhead that doesn't go away
just because it's on one machine. Of course, that's the very reason it
scales past one
the dataset i'm working on has about 100,000 records. the batch that we're
training on has a size around 10. can you repartition(10,000) into 10,000
partitions?
On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
You can just do mapPartitions on the whole RDD, and
TD,
While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these
two methods:
def insertInto(tableName: String): Unit
def insertInto(tableName: String, overwrite: Boolean): Unit
Wouldnt these be a nicer way of appending RDD's to a table or are these not
recommended as of now?
Also, you may want to use .lookup() instead of .filter()
def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done
efficiently if the RDD has a known partitioner by only searching the
partition that the key maps to.
You might want to partition your first
This seems to be compilation errors. The second one seems to be that
you are using CassandraJavaUtil.javafunctions wrong. Look at the
documentation and set the parameter list correctly.
TD
On Mon, Dec 8, 2014 at 9:47 AM, m.sar...@accenture.com wrote:
Hi,
I am intending to save the streaming
Aah yes, that makes sense. You could write first to HDFS, and when
that works, copy from HDFS to S3. That should work as it wont depend
on the temporary files to be in S3.
I am not sure how much you can customize just for S3 in Spark code. In
Spark, since we just use Hadoop API to write there isnt
Is the OOM happening to the Driver JVM or one of the Executor JVMs? What
memory size is each JVM?
How large is the data you're trying to broadcast? If it's large enough, you
may want to consider just persisting the data to distributed storage (like
HDFS) and read it in through the normal read RDD
I was having similar issues with my persistent RDDs. After some digging
around, I noticed that the partitions were not balanced evenly across the
available nodes. After a repartition, the RDD was spread evenly across
all available memory. Not sure if that is something that would help your
use-case
I am running a Hadoop cluster with Spark on YARN. The cluster running the
CDH5.2 distribution. When I try to run spark jobs against snappy compressed
files I receive the following error.
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
We are trying to submit a Spark application from a Tomcat application running
our business logic. The Tomcat app lives in a seperate non-hadoop cluster.
We first were doing this by using the spark-yarn package to directly call
Client#runApp() but found that the API we were using in Spark is being
In PySpark, is there a way to get the status of a job which is currently
running? My use case is that I have a long running job that users may not know
whether or not the job is still running. It would be nice to have an idea of
whether or not the job is progressing even if it isn't very
Gerard,
Are you familiar with spark.deploy.spreadOut
http://spark.apache.org/docs/latest/spark-standalone.html in Standalone
mode? It sounds like you want the same thing in Mesos mode.
On Thu, Dec 11, 2014 at 6:48 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Not that I am aware of.
Hi,
I'm looking for resources and examples for the deployment of spark streaming
in production. Specifically, I would like to know how high availability and
fault tolerance of receivers is typically achieved.
The workers are managed by the spark framework and are therefore fault
tolerant
(Sorry if this mail is duplicate, but it seems that my previous mail could
not reach the mailing list.)
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR
Hi,
If spark based services are to be exposed as a continuously available
server, what are the options?
* The API exposed to client will be proprietary and fine grained (RPC style
..), not a Job level API
* The client API need not be SQL so the Thrift JDBC server does not seem to
be option ..
Oops, sorry, fat fingers.
We've been playing with something like that inside Hive:
https://github.com/apache/hive/tree/spark/spark-client
That seems to have at least a few of the characteristics you're
looking for; but it's a very young project, and at this moment we're
not developing it as a
Hi Manoj,
I'm not aware of any public projects that do something like that,
except for the Ooyala server which you say doesn't cover your needs.
We've been playing with something like that inside Hive, though:
On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com wrote:
Hi,
This is showing a factor of 200 between python and scala and 1400 when
distributed.
Is this really accurate?
If not, what is the real performance difference expected on average between
the 3 cases?
On Thu, Dec 11, 2014 at 11:33 AM, Duy Huynh duy.huynh@gmail.com wrote:
just to give some
Is the class com.dataken.spark.examples.MyRegistrator public? if not, change
it to public and give a try.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KryoRegistrator-exception-and-Kryo-class-not-found-while-compiling-tp10396p20646.html
Sent from the
class MyRegistrator implements KryoRegistrator {
public void registerClasses(Kryo kryo) {
kryo.register(ImpressionFactsValue.class);
}
}
change this class to public and give a try
--
View this message in context:
Minor correction: I think you want iterator.grouped(10) for
non-overlapping mini batches
On Dec 11, 2014 1:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
You can just do mapPartitions on the whole RDD, and then called sliding()
on the iterator in each one to get a sliding window. One
Spark Streaming takes care of restarting receivers if it fails.
Regarding the fault-tolerance properties and deployment options, we
made some improvements in the upcoming Spark 1.2. Here is a staged
version of the Spark Streaming programming guide that you can read for
the up-to-date explanation
Hi all. I've been working on a similar problem. One solution that is
straightforward (if suboptimal) is to do the following.
A.zipWithIndex().filter(_._2 =range_start _._2 range_end). Lastly
just put that in a for loop. I've found that this approach scales very
well.
As Matei said another
(Sorry if this mail is a duplicate, but it seems that my previous mail
could not reach the mailing list.)
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR
Hi Judy,
SPM monitors Spark. Here are some screenshots:
http://blog.sematext.com/2014/10/07/apache-spark-monitoring/
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr Elasticsearch Support * http://sematext.com/
On Mon, Dec 8, 2014 at 2:35 AM, Judy Nash
I'm currently new to pyspark, thank you for your patience in advance - my
current problem is the following:
I have a RDD composed of the field A, B, and count =
result1 = rdd.map(lambda x: (A,B),1).reduceByKey(lambda a,b: a + b)
Then I wanted to group the results based on 'A', so I did
Kindly take a moment to look over this proposal to bring Spark into the
U.S. Treasury:
http://www.systemaccounting.org/sparking_the_data_driven_republic
Hi, there.
I'm trying to understand how to augment data in a SchemaRDD.
I can see how to do it if can express the added values in SQL - just run
SELECT *,valueCalculation AS newColumnName FROM table
I've been searching all over for how to do this if my added value is a
scala function, with no
I'm trying to build a very simple scala standalone app using the Mllib, but I
get the following error when trying to bulid the program:Object Mllib is not a
member of package org.apache.sparkThen, I realized that I have to add Mllib as
dependency as follow :libraryDependencies ++= Seq(
Hi,Try this.Change spark-mllib to spark-mllib_2.10
libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1
org.apache.spark % spark-mllib_2.10 % 1.1.1 )
Thanks Regards,
Meethu M
On Friday, 12 December 2014 12:22 PM, amin mohebbi
aminn_...@yahoo.com.INVALID wrote:
Hi Tomer,
In yarn-cluster mode, the application has already been submitted to YARN by
the time the SparkContext is created, so it's too late to set the app name
there. I believe giving it with the --name property to spark-submit should
work.
-Sandy
On Thu, Dec 11, 2014 at 10:28 AM, Tomer
Put Jar file in site HDFS, URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is present on
all nodes.
-
Software Developer
SigmoidAnalytics, Bangalore
--
View this message in context:
82 matches
Mail list logo