Change your conf/spark-env.sh:
export HADOOP_CONF_DIR=/etc/hadoop/confexport YARN_CONF_DIR=/etc/hadoop/conf
Date: Thu, 28 Aug 2014 16:19:05 -0700
From: ml-node+s1001560n13074...@n3.nabble.com
To: linkpatrick...@live.com
Subject: problem connection to hdfs on localhost from spark-shell
Hi ,
I am working with pyspark and doing simple aggregation
def doSplit(x):
y = x.split(',')
if(len(y)==3):
return y[0],(y[1],y[2])
counts = lines.map(doSplit).groupByKey()
output = counts.collect()
Iterating over output I got such format of the data u'1385501280'
hi, dear
Thanks for the response. Some comments below. and yes, I am using spark on
yarn.
1. The release doc of spark says multi jobs can be submitted in one application
if the jobs(actions) are submit by different threads. I wrote some java thread
code in driver, one action in each thread,
Hey Michael, Cheng,
Thanks for the replies. Sadly I can't remember the specific error so I'm
going to chalk it up to user error, especially since others on the list
have not had a problem.
@michael By the way, was at the Spark 1.1 meetup yesterday. Great event,
very informative, cheers and keep
ok, but what if I have a long list do I need to hard code like this every
element of my list of is there a function that translate a list into a
tuple ?
On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust mich...@databricks.com
wrote:
You don't need the Seq, as in is a variadic function.
I upped the ulimit to 128k files on all nodes. Job crashed again with
DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275.
Couldn't get the logs because I killed the job and looks like yarn
wipe the container logs (not sure why it wipes the logs under
/var/log/hadoop-yarn/container).
I set partitions to 64:
//
kInMsg.repartition(64)
val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
//
Still see all activity only on the two nodes that seem to be receiving
from Kafka.
On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith secs...@gmail.com wrote:
TD - Apologies, didn't realize
You can use the Thrift server to access Hive tables that locates in legacy
Hive warehouse and/or those generated by Spark SQL. Simba provides Spark
SQL ODBC driver that enables applications like Tableau. But right now I'm
not 100% sure about whether the driver has officially released yet.
On
Hi all
Spark web ui gives me the information about total cores and used cores.
I want to get this information programmatically.
How can I do this?
Thanks
Kevin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-get-the-number-of-cores-tp13111.html
To pass a list to a variadic function you can use the type ascription :_*
For example:
val longList = Seq[Expression](a, b, ...)
table(src).where('key in (longList: _*))
Also, note that I had to explicitly specify Expression as the type
parameter of Seq to ensure that the compiler converts a
u'14.0' means a unicode string, you can convert into str by
u'14.0'.encode('utf8'), or you can convert it into float by
float(u'14.0')
Davies
On Thu, Aug 28, 2014 at 11:22 PM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I am working with pyspark and doing simple aggregation
def
Say you have a spark streaming setup such as
JavaReceiverInputDStream... rndLists = jssc.receiverStream(new
JavaRandomReceiver(...));
rndLists.map(new NeuralNetMapper(...))
.foreach(new JavaSyncBarrier(...));
Is there any way of ensuring that, say, a JavaRandomReceiver and
Hi,
My requirement is to run Spark on Yarn without using the script
spark-submit.
I have a servlet and a tomcat server. As and when request comes, it creates
a new SC and keeps it alive for the further requests, I ma setting my
master in sparkConf
as sparkConf.setMaster(yarn-cluster)
but the
Thank you Yanbo for the reply..
I 've another query related to cogroup.I want to iterate over the results of
cogroup operation.
My code is
* grp = RDD1.cogroup(RDD2)
* map((lambda (x,y): (x,list(y[0]),list(y[1]))), list(grp))
My result looks like :
[((u'764', u'20140826'),
Still not working for me. I got a compilation error : *value in is not a
member of Symbol.* Any ideas ?
On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust mich...@databricks.com
wrote:
To pass a list to a variadic function you can use the type ascription :_*
For example:
val longList =
including user@spark.apache.org.
On Fri, Aug 29, 2014 at 2:03 PM, Archit Thakur archit279tha...@gmail.com
wrote:
Hi,
My requirement is to run Spark on Yarn without using the script
spark-submit.
I have a servlet and a tomcat server. As and when request comes, it
creates a new SC and
It's not allowed to use RDD in map function.
RDD can only operated at driver of spark program.
At your case, group RDD can't be found at every executor.
I guess you want to implement subquery like operation, try to use
RDD.intersection() or join()
2014-08-29 12:43 GMT+08:00 Gary Zhao
Hi,
I think an example will help illustrate the model better.
/*** SimpleApp.scala ***/import org.apache.spark.SparkContextimport
org.apache.spark.SparkContext._
object SimpleApp { def main(args: Array[String]) {val logFile =
$YOUR_SPARK_HOME/README.md val sc = new SparkContext(local,
Hi,
Tried the same thing in HIVE directly without issue:
HIVE:
hive create table test_datatype2 (testbigint bigint );
OK
Time taken: 0.708 seconds
hive drop table test_datatype2;
OK
Time taken: 23.272 seconds
Then tried again in SPARK:
scala val hiveContext = new
i see it works well,thank you!!!
But in follow situation how to do
var a = sc.textFile(/sparktest/1/).map((_,a))
var b = sc.textFile(/sparktest/2/).map((_,b))
How to get (3,a) and (4,a)
在 Aug 28, 2014,19:54,Matthew Farrellee m...@redhat.com 写道:
On 08/28/2014 07:20 AM, marylucy wrote:
Archit
We are using yarn-cluster mode , and calling spark via Client class
directly from servlet server. It works fine.
To establish a communication channel to give further requests,
It should be possible with yarn client, but not with yarn server. Yarn
client mode, spark driver
Hi Daniel,
Your suggestion is definitely an interesting approach. In fact, I already
have another system to deal with the stream analytical processing part. So
basically, the Spark job to aggregate data just accumulatively computes
aggregations from historical data together with new batch, which
What version of Spark are you running?
Try calling sc.defaultParallelism. I’ve found that it is typically set to
the number of worker cores in your cluster.
On Fri, Aug 29, 2014 at 3:39 AM, Kevin Jung itsjb.j...@samsung.com wrote:
Hi all
Spark web ui gives me the information about total
I understand that the DB writes are happening from the workers unless you
collect. My confusion is that you believe workers recompute on recovery(nodes
computations which get redone upon recovery). My understanding is that
checkpointing dumps the RDD to disk and the cuts the RDD lineage. So I
How did you specify the HDFS path? When i put
spark.eventLog.dir hdfs://
crosby.research.intel-research.net:54310/tmp/spark-events
in my spark-defaults.conf file, I receive the following error:
An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
:
Hi all,
I would like to ask some advice about resetting spark stateful operation.
so i tried like this:
JavaStreamingContext jssc = new JavaStreamingContext(context, new
Duration(5000));
jssc.remember(Duration(5*60*1000));
jssc.checkpoint(ApplicationConstants.HDFS_STREAM_DIRECTORIES);
Chris,
I did the Dstream.repartition mentioned in the document on parallelism in
receiving, as well as set spark.default.parallelism and it still uses only
2 nodes in my cluster. I notice there is another email thread on the same
topic:
Hi All,
Yesterday I restarted my cluster, which had the effect of clearing /tmp.
When I brought Spark back up and ran my first job, /tmp/spark-events was
re-created and the job ran fine. I later learned that other users were
receiving errors when trying to create a spark context. It turned out
What version are you using?
On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Still not working for me. I got a compilation error : *value in is not a
member of Symbol.* Any ideas ?
On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust mich...@databricks.com
wrote:
codes is a DStream, not an RDD. The remember() method controls how
long Spark Streaming holds on to the RDDs itself. Clarify what you
mean by reset? codes provides a stream of RDDs that contain your
computation over a window of time. New RDDs come with the computation
over new data.
On Fri, Aug
'this 2-node replication is mainly for failover in case the receiver dies
while data is in flight. there's still chance for data loss as there's no
write ahead log on the hot path, but this is being addressed.'
Can you comment a little on how this will be addressed, will there be a
durable WAL?
so the codes currently holding RDD containing codes and its respective
counter. I would like to find a way to reset those RDD after some period of
time.
On Fri, Aug 29, 2014 at 5:55 PM, Sean Owen so...@cloudera.com wrote:
codes is a DStream, not an RDD. The remember() method controls how
long
Hello Sparkies !
Could anyone please answer this? This is not an Hadoop cluster, so which
download option should I use to download for standalone cluster ?
Also what are the best practices if you’ve 1TB of data and want to use spark ?
Do you’ve to use Hadoop/CDH or some other option ?
Hi everyone! I've been working on Smoke, a web frontend to
interactively launch Spark jobs without compiling it (only support
Scala right now, and launching the jobs on yarn-client mode). It works
executing the Scala script using spark-shell in the Spark server.
It's developed in Python, uses
Spark SQL is based on Hive 12. They must have changed the maximum key size
between 12 and 13.
On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
Tried the same thing in HIVE directly without issue:
HIVE:
hive create table test_datatype2
Thanks Michael, that makes total sense.
It works perfectly.
Yadid
On Thu, Aug 28, 2014 at 9:19 PM, Michael Armbrust mich...@databricks.com
wrote:
The comma is just the way the default toString works for Row objects.
Since SchemaRDDs are also RDDs, you can do arbitrary transformations on
Hi All,
New to spark and using Spark 1.0.2 and hive 0.12.
If hive table created as test_datatypes(testbigint bigint, ss bigint ) select
* from test_datatypes from spark works fine.
For create table test_datatypes(testbigint bigint, testdec decimal(5,2) )
scala val
Hi,
I am having the same problem reported by Michael. I am trying to open 30
files. ulimit -n shows the limit is 1024. So I am not sure why the program
is failing with Too many open files error. The total size of all the 30
files is 230 GB.
I am running the job on a cluster with 10 nodes, each
I'm thinking of local mode where multiple virtual executors occupy the same
vm. Can we have the same configuration in spark standalone cluster mode?
Yes, executors run one task per core of your machine by default. You can also
manually launch them with more worker threads than you have cores. What cluster
manager are you on?
Matei
On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote:
I'm thinking of local mode
I wrote a long post about how I arrived here but in a nutshell I don't see
evidence of re-partitioning and workload distribution across the cluster.
My new fangled way of starting the job is:
run=`date +%m-%d-%YT%T`; \
nohup spark-submit --class logStreamNormalizer \
--master yarn
Standalone. I'd love to tell it that my one executor can simultaneously
serve, say, 16 tasks at once for an arbitrary number of distinct jobs.
On Fri, Aug 29, 2014 at 11:29 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Yes, executors run one task per core of your machine by default. You can
1.0.2
On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com wrote:
What version are you using?
On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com
javascript:_e(%7B%7D,'cvml','jaon...@gmail.com'); wrote:
Still not working for me. I got a compilation error
Crash again. On the driver, logs say:
14/08/29 19:04:55 INFO BlockManagerMaster: Removed 7 successfully in
removeExecutor
org.apache.spark.SparkException: Job aborted due to stage failure: Task
2.0:0 failed 4 times, most recent failure: TID 6383 on host
node-dn1-2-acme.com failed for unknown
Hi folks,
I am trying to use Kafka with Spark Streaming, and it appears I cannot do
the normal 'sbt package' as I do with other Spark applications, such as
Spark alone or Spark with MLlib. I learned I have to build with the
sbt-assembly plugin.
OK, so here is my build.sbt file for my extremely
Can you share more details about your job, cluster properties and
configuration parameters?
Thanks,
Nishkam
On Fri, Aug 29, 2014 at 11:33 AM, Chirag Aggarwal
chirag.aggar...@guavus.com wrote:
When I run SparkSql over yarn, it runs 2-4 times slower as compared to
when its run in local mode.
Hi,
I am facing the same problem.
Did you find any solution or work around?
Thanks and Regards,
Archit Thakur.
On Thu, Jan 16, 2014 at 6:22 AM, Liu, Raymond raymond@intel.com wrote:
Hi
Regarding your question
1) when I run the above script, which jar is beed submitted to the yarn
Here’s a repro for PySpark:
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is
what I get:
a = sc.parallelize([Nick, John, Bob]) a =
Good to see I am not the only one who cannot get incoming Dstreams to
repartition. I tried repartition(512) but still no luck - the app
stubbornly runs only on two nodes. Now this is 1.0.0 but looking at
release notes for 1.0.1 and 1.0.2, I don't see anything that says this
was an issue and has
TD,
can you please comment on this code?
I am really interested in including this code in Spark.
But i am bothering about some point about persistence:
1. When we extend Receiver and call store,
is it blocking call? Does it return only when spark stores rdd as requested
(i.e. replicated or
This feature was not part of that version. It will be in 1.1.
On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
1.0.2
On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com
wrote:
What version are you using?
On Fri, Aug 29, 2014 at 2:22 AM,
I need some advices regarding how data are stored in an RDD. I have millions
of records, called Measures. They are bucketed with keys of String type.
I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String,
Iterable[Measure])], and why?
Data in each bucket are not related
I create my DStream very simply as:
val kInMsg =
KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct
- 8))
.
.
eventually, before I operate on the DStream, I repartition it:
kInMsg.repartition(512)
Are you saying that ^^ repartition doesn't split by dstream into multiple
You can use a tuple associating a timestamp to your running sum; and have
COMPUTE_RUNNING_SUM to reset the running sum to zero when the timestamp is
more than 5 minutes old.
You'll still have a leak doing so if your keys keep changing, though.
--Christophe
2014-08-29 9:00 GMT-07:00 Eko Susilo
Any more thoughts on this? I'm not sure how to do this yet.
On Fri, Aug 29, 2014 at 12:10 PM, Victor Tso-Guillen v...@paxata.com
wrote:
Standalone. I'd love to tell it that my one executor can simultaneously
serve, say, 16 tasks at once for an arbitrary number of distinct jobs.
On Fri, Aug
Ok, so I did this:
val kInStreams = (1 to 10).map{_ =
KafkaUtils.createStream(ssc,zkhost1:2181/zk_kafka,testApp,Map(rawunstruct
- 1)) }
val kInMsg = ssc.union(kInStreams)
val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
This has improved parallelism. Earlier I would only get a Stream 0.
Just checked it with 1.0.2
Still same exception.
From: Anton Brazhnyk [mailto:anton.brazh...@genesys.com]
Sent: Wednesday, August 27, 2014 6:46 PM
To: Tathagata Das
Cc: user@spark.apache.org
Subject: RE: [Streaming] Akka-based receiver with messages defined in uploaded
jar
Sorry for the delay
Can you try adding the JAR to the class path of the executors directly, by
setting the config spark.executor.extraClassPath in the SparkConf. See
Configuration page -
http://spark.apache.org/docs/latest/configuration.html#runtime-environment
I think what you guessed is correct. The Akka actor
Ops,the last reply didn't go to the user list. Mail app's fault.
Shuffling happens in the cluster, so you need change all the nodes in the
cluster.
Sent from my iPhone
On 2014年8月30日, at 3:10, Sudha Krishna skrishna...@gmail.com wrote:
Hi,
Thanks for your response. Do you know if I
59 matches
Mail list logo