Hi Punya,
This seems like a problem that the recently-announced job-server would
likely have run into at one point. I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes. Does the server correctly segregate each job's
classes
Try checking spark-env.sh on the workers as well. Maybe code there is somehow
overriding the spark.executor.memory setting.
Matei
On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hello, I'm using the Github snapshot of PySpark and having trouble setting
the worker memory
:-) Thanks for suggestion.
I was actually asking how to run spark scripts as a standalone App. I am
able to run Java code and Python code as standalone app.
one more doubt, documentation says - to read HDFS file, we need to add
dependency
groupIdorg.apache.hadoop/groupId
Check this thread out,
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2807.html
-- you have to remove conflicting akka and protbuf versions.
Thanks
Prasad.
--
View this message in context:
I don't know the Spark issue but the Hadoop context is clear.
old api - org.apache.hadoop.mapred
new api - org.apache.hadoop.mapreduce
You might only need to change your import.
Regards
Bertrand
On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre pbarapa...@gmail.com
wrote:
Hi,
Seems like import issue, ran with HadoopFile and it worked. Not getting
import statement for textInputFormat class location for new API.
Can anybody help?
Thanks
Pariksheet
On 19 March 2014 16:05, Bertrand Dechoux decho...@gmail.com wrote:
I don't know the Spark issue but the Hadoop context
Hi all,
I'm trying to build an evaluation platform based on Spark. The idea is to
run a blackbox executable (build with c/c++ or some scripting language).
This blackbox takes a set of data as input and outpout some metrics. Since
I have a huge amount of data, I need to distribute the computation
If I don't call iter(), and just return treeiterator directly, I get an
error message that the object is not of an iterator type. This is in
Python 2.6...perhaps a bug?
BUT I also realized my code was wrong. It results in an RDD containing all
the tags in all the files. What I really want is
Do you want to read the file content in the following statement?
val ny_daily= sc.parallelize(List(hdfs://localhost:8020/user/user/
NYstock/NYSE_daily))
If so, you should use textFile, e.g.,
val ny_daily= sc.textFile(hdfs://localhost:8020/user/user/
NYstock/NYSE_daily)
parallelize is used to
Hi Andrew,
Thanks for pointing me to that example. My understanding of the JobServer
(based on watching a demo of its UI) is that it maintains a set of spark
contexts and allows people to add jars to them, but doesn't allow unloading
or reloading jars within a spark context. The code in JobCache
Thanks . it worked..
Very basic question, i have created custominput format e.g. stock. How do
I refer this class as custom inputformat. I.e. where to keep this class on
linux folder. Do i need to add this jar if so how .
I am running code through spark-shell.
Thanks
Pari
On 19-Mar-2014 7:35
Very cool!
-Original Message-
From: Chanwit Kaewkasi [mailto:chan...@gmail.com]
Sent: Wednesday, March 19, 2014 10:36 AM
To: user@spark.apache.org
Subject: Spark enables us to process Big Data on an ARM cluster !!
Hi all,
We are a small team doing a research on low-power (and low-cost)
i dont know anything about arm clusters but it looks great. what are
the specs? the nodes have no local disk at all?
On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote:
Hi all,
We are a small team doing a research on low-power (and low-cost) ARM
clusters. We built
Hi,
I'm getting the following error:
java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at
I doubt thr is something like this out of the box. Easiest thing is to
package it in to a jar send that jar across.
Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Mar 19, 2014 at 6:57 AM, Jaonary Rabarisoa
Chanwit, that is awesome!
Improvements in shuffle operations should help improve life even more for
you. Great to see a data point on ARM.
Sent while mobile. Pls excuse typos etc.
On Mar 18, 2014 7:36 PM, Chanwit Kaewkasi chan...@gmail.com wrote:
Hi all,
We are a small team doing a research
Actually, thinking more on this question, Matei: I'd definitely say support
for Avro. There's a lot of interest in this!!
On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
BTW one other thing -- in your experience, Diana, which non-text
InputFormats would be most
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.
On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote:
its 0.9 snapshot from january running in standalone mode.
have these fixed been merged into 0.9?
On Thu, Mar 6, 2014 at 12:45 AM,
The data may be spilled off to disk hence HDFS is a necessity for Spark.
You can run Spark on a single machine not use HDFS but in distributed
mode HDFS will be required.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On
Another vote on this, support for simple SequenceFiles and/or Avro would be
terrific, as using plain text can be very space-inefficient, especially for
numerical data.
-- Jeremy
On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I'd second the request for Avro
Just figure it out. I need to add a file:// in URI. I guess it is not
needed in previous Hadoop versions.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-8-examples-in-local-mode-tp2892p2897.html
Sent from the Apache Spark User List mailing list
I am testing the performance of Spark to see how it behaves when the
dataset size exceeds the amount of memory available. I am running
wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB
RAM per node). I limited spark.executor.memory to 64g, so I have 256g
of memory available in
I'm in San Francisco until Friday for a conference (visiting from Boston).
If any of y'all are up for a drink or something, I'd love to meet you in
person and say hi.
Nick
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/in-SF-until-Friday-tp2900.html
persist and unpersist.
unpersist:Mark the RDD as non-persistent, and remove all blocks for it from
memory and disk
2014-03-19 16:40 GMT+08:00 林武康 vboylin1...@gmail.com:
Hi, can any one tell me about the lifecycle of an rdd? I search through
the official website and still can't figure it out.
Related question:
If I keep creating new RDDs and cache()-ing them, does Spark automatically
unpersist the least recently used RDD when it runs out of memory? Or is an
explicit unpersist the only way to get rid of an RDD (barring the PR
Tathagata mentioned)?
Also, does unpersist()-ing an RDD
Yes, Spark automatically removes old RDDs from the cache when you make new
ones. Unpersist forces it to remove them right away. In both cases though, note
that Java doesn’t garbage-collect the objects released until later.
Matei
On Mar 19, 2014, at 7:22 PM, Nicholas Chammas
Okie doke, good to know.
On Wed, Mar 19, 2014 at 7:35 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Yes, Spark automatically removes old RDDs from the cache when you make new
ones. Unpersist forces it to remove them right away. In both cases though,
note that Java doesn’t garbage-collect
I have the exact same error running on a bare metal cluster with CentOS6
and Python 2.6.6. Any other thoughts on the problem here? I only get the
error on operations that require communication, like reduceByKey or groupBy.
On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas
So I have the pyspark shell open and after some idle time I sometimes get
this:
PySpark worker failed with exception:
Traceback (most recent call last):
File /root/spark/python/pyspark/worker.py, line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File
Hi,
As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run
a given func on each and every RDD inside a DStream.
I created a simple program which reads log files from a folder every hour:
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60
*
30 matches
Mail list logo