Hi,
I am using newAPIHadoopRDD to load RDD from hbase (using pyspark running as
yarn-client) - pretty much the standard case demonstrated in the
hbase_inputformat.py from examples... the thing is the when trying the very
same code on spark 1.2 I am getting the error bellow which based on
Hey,
I have a job that keeps failing if too much data is processed, and I can't
see how to get it working. I've tried repartitioning with more partitions
and increasing amount of memory for the executors (now about 12G and 400
executors. Here is a snippets of the first part of the code, which
You can also use join function of rdd. This is actually kind of append
funtion that add up all the rdds and create one uber rdd.
On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote:
Thank you for the response, sure will try that out.
Currently I changed my code such that the first
As I recall Oryx (the old version, and I assume the new one too) provide
something like this:
http://cloudera.github.io/oryx/apidocs/com/cloudera/oryx/als/common/OryxRecommender.html#recommendToAnonymous-java.lang.String:A-float:A-int-
though Sean will be more on top of that than me :)
On Mon,
You’re right, Nick! This function does exactly that.
Sean has already helped me greatly.
Thanks for your reply.
Wouter Samaey
Zaakvoerder Storefront BVBA
Tel: +32 472 72 83 07
Web: http://storefront.be
LinkedIn: http://www.linkedin.com/in/woutersamaey
On 07 Jan 2015, at 11:08, Nick
Fantastic - glad to see that it's in the pipeline!
On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust mich...@databricks.com
wrote:
I want to support this but we don't yet. Here is the JIRA:
https://issues.apache.org/jira/browse/SPARK-3851
On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore
Thank you for the response, sure will try that out.
Currently I changed my code such that the first map files.map to
files.flatMap, which I guess will do similar what you are saying, it gives
me a List[] of elements (in this case LabeledPoints, I could also do RDDs)
which I then turned into a
Hi,
Are insert statements supported in Spark? if so, can you please give me an
example?
Rgds
--
Niranda
Hi,
Curious and curious. I'm puzzled by the Spark SQL cached table.
Theoretically, the cached table should be columnar table, and only scan
the column that included in my SQL.
However, in my test, I always see the whole table is scanned even though
I only select one column in
It seems that spark-network-yarn compiled for scala 2.11 depends on
spark-network-shuffle compiled for scala 2.10. This causes cross version
dependencies conflicts in sbt. Seems like a publishing error?
http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR
Edit your conf/log4j.properties file and Change the following line:
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
Another approach would be to :
Fireup spark-shell and type in the following:
import org.apache.log4j.Logger
import
How about the following code? I'm not quiet sure what you were doing inside
the flatmap and foreach.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import
I think you mean union(). Yes, you could also simply make an RDD for each
file, and use SparkContext.union() to put them together.
On Wed, Jan 7, 2015 at 9:51 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
You can also use join function of rdd. This is actually kind of append
Problems like this are always due to having code compiled for Hadoop 1.x
run against Hadoop 2.x, or vice versa. Here, you compiled for 1.x but at
runtime Hadoop 2.x is used.
A common cause is actually bundling Spark / Hadoop classes with your app,
when the app should just use the Spark / Hadoop
1.2
I run() you have
usersOut.setName(usersOut).persist(StorageLevel.MEMORY_AND_DISK)
productsOut.setName(productsOut).persist(StorageLevel.MEMORY_AND_DISK)
On Wed, Jan 7, 2015, 02:11 Xiangrui Meng men...@gmail.com wrote:
Which Spark version are you using? We made this configurable
Hi Spark/GraphX community,
I'm wondering if you have TinkerPop3/Gremlin on your radar?
(github https://github.com/tinkerpop/tinkerpop3, doc
http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT)
They've done an amazing work refactoring their stack recently and Gremlin
is a very nice DSL to work with
Hi experts!
Is there any way to decide what can be effective receiving rate for kafka
spark streaming?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/max-receiving-rate-in-spark-streaming-tp21013.html
Sent from the Apache Spark User List mailing
Hi,
I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I
was able to run this test fine:
test(Sliding window join with 3 second window duration) {
val input1 =
Seq(
Seq(req1),
Seq(req2, req3),
Seq(),
Seq(req4, req5, req6),
Seq(req7),
If you are using the Lowlevel consumer
https://github.com/dibbhatt/kafka-spark-consumer then you have an option
to tweak the rate by setting *_fetchSizeBytes
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/KafkaConfig.java#L37
*value. Default
is 64kb, you
You need to put some files in the location
*(/user/huser/user/huser/flume)* once
the job starts running, then only it will print. also note i missed the /
in the above code.
Thanks
Best Regards
On Wed, Jan 7, 2015 at 4:42 PM, Jeniba Johnson
jeniba.john...@lntinfotech.com wrote:
Hi Akhil,
Hi all,
Recently I played a little bit with both naive and mllib python sample codes
for logistic regression.
In short I wanted to compare naive and non naive logistic regression results
using same input weights and data.
So, I modified slightly both sample codes to use the same initial weights
Hi Guys,
I have a kafka topic having 90 partitions and I running
SparkStreaming(1.2.0) to read from kafka via KafkaUtils to create 10
kafka-receivers.
My streaming is running fine and there is no delay in processing, just that
some partitions data is never getting picked up. From the kafka
Hi,
Could you add the code where you create the Kafka consumer?
-kr, Gerard.
On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote:
Hi Mukesh,
If my understanding is correct, each Stream only has a single Receiver.
So, if you have each receiver consuming 9 partitions, you
Hi,
I am currently running pyspark jobs against Spark 1.1.0 on YARN. When I run
example Java jobs such as spark-pi, the following files get created:
bash-4.1$ tree spark-pi-1420624364958
spark-pi-1420624364958
âââ APPLICATION_COMPLETE
âââ EVENT_LOG_1
âââ SPARK_VERSION_1.1.0
0 directories, 3
Hi Mukesh,
If my understanding is correct, each Stream only has a single Receiver. So, if
you have each receiver consuming 9 partitions, you need 10 input DStreams to
create 10 concurrent receivers:
Another option is to make a copy of log4j.properties in the current
directory where you start spark-shell from, and modify
log4j.rootCategory=INFO,
console to log4j.rootCategory=ERROR, console. Then start the shell.
On Wed, Jan 7, 2015 at 3:39 AM, Akhil ak...@sigmoidanalytics.com wrote:
Edit
Hi all,
I am trying to run this example on mesos:
https://github.com/jaceklaskowski/spark-activator#master
https://github.com/jaceklaskowski/spark-activator#master
I have mesos 0.21.0 (instead of 0.18.1, could that be a problem?)
I download spark pre built package spark-1.2.0-bin-hadoop2.4.tgz
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
based on some predefined business cases. After updating to 1.2.0, some of
the concurrency expectations about how the stages within jobs are executed
But Xuelin already posted in the original message that the code was using
SET spark.sql.parquet.filterPushdown=true
On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com wrote:
Quoting Michael:
Predicate push down into the input format is turned off by default because
there is
Hi,
Can you please suggest some of the best available trainings/ coaching and
professional certifications in Apache Spark?
We are trying to run predictive analysis on our Sales data and come out with
recommendations (leads). We have tried to run CF but we end up getting
absolutely bogus
Hi Patrick: Do you know what the status of this issue is? Is there a JIRA
that is tracking this issue?
Thanks.
Asim
Patrick Wendell writes: Within a partition things will spill - so the
current documentation is correct. This spilling can only occur *across keys*
at the moment. Spilling cannot
Hi,
On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote:
dstream processing bulk HDFS data- is something I don't feel is super
well socialized yet, fingers crossed that base gets built up a little
more.
Just out of interest (and hoping not to hijack my own thread), why are you
Yes, the problem is, I've turned the flag on.
One possible reason for this is, the parquet file supports predicate
pushdown by setting statistical min/max value of each column on parquet
blocks. If in my test, the groupID=10113000 is scattered in all parquet
blocks, then the predicate pushdown
Hi,
Currently, we are building up a middle scale spark cluster (100 nodes)
in our company. One thing bothering us is, the how spark manages the
resource (CPU, memory).
I know there are 3 resource management modes: stand-along, Mesos, Yarn
In the stand along mode, the cluster
Hi Xuelin,
I can only speak about Mesos mode. There are two modes of management in
Spark's Mesos scheduler, which are fine-grain mode and coarse-grain mode.
In fine grain mode, each spark task launches one or more spark executors
that only live through the life time of the task. So it's
I have not used CDH5.3.0. But looks
spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar contains some
hadoop1 jars (come from a wrong hbase version).
I don't know the recommanded way to build spark-examples jar because the
official Spark docs does not mention how to build spark-examples jar.
Hi,I am using Eclipse writing Java code.
I am trying to create a Kafka receiver by:
JavaPairReceiverInputDStreamString, kafka.message.Message a =
KafkaUtils.createStream(jssc, String.class, Message.class,
StringDecoder.class, DefaultDecoder.class, kafkaParams,
Try loading features as
Val userfeatures = sc.objectFile(path1)
Val productFeatures = sc.objectFile(path2)
And then call the constructor of the MatrixFsgtorizationModel with those.
Sent with Good (www.good.com)
-Original Message-
From: wanbo [gewa...@163.commailto:gewa...@163.com]
Hi,
I have a setup where data from an external stream is piped into Kafka and
also written to HDFS periodically for long-term storage. Now I am trying to
build an application that will first process the HDFS files and then switch
to Kafka, continuing with the first item that was not yet in HDFS.
I've started 1 or 2 emails to ask more broadly- what are good practices
for doing DStream computations in a non-realtime fashion? I'd love to have a
good intro article to pass around to people, and some resources for those
others chasing this problem.
Back when I was working with Storm, managing
Hi,
Thanks for the information.
One more thing I want to clarify, when does Mesos or Yarn allocate and
release the resource? Aka, what is the resource life time?
For example, in the stand-along mode, the resource is allocated when
the application is launched, resource released
In coarse grain mode, the spark executors are launched and kept running
while the scheduler is running. So if you have a spark shell launched and
remained open, the executors are running and won't finish until the shell
is exited.
In fine grain mode, the overhead time mostly comes from
Hi all:
In my application, I start SparkContext sc and execute some task on sc.
(Each task is a thread, which execute some transform and action on RDDs)
For each task, I use sc.setJobGroup(JOB_GROUPID, JOB_DESCRIPTION) to
set jobGroup for each task.
But when I call
You can follow the below the link also. It works on stand alone spark cluster.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
thanks
Somnath
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, January 08, 2015 2:21 AM
To: jamborta
Cc: user
I save and reload model like this:
val bestModel = ALS.train(training, rank, numIter, lambda)
bestModel.get.userFeatures.saveAsObjectFile(hdfs://***:9000/spark/results/userfeatures)
bestModel.get.productFeatures.saveAsObjectFile(hdfs://***:9000/spark/results/productfeatures)
val bestModel =
On Thu, Jan 08, 2015 at 02:33:30PM +0900, Tobias Pfeiffer wrote:
Hi,
On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote:
dstream processing bulk HDFS data- is something I don't feel is super
well socialized yet, fingers crossed that base gets built up a little
more.
Hi Xuelin,
Spark 1.2 includes a dynamic allocation feature that allows Spark on YARN
to modulate its YARN resource consumption as the demands of the application
grow and shrink. This is somewhat coarser than what you call task-level
resource management. Elasticity comes through allocating and
Got it, thanks.
On Thu, Jan 8, 2015 at 3:30 PM, Tim Chen t...@mesosphere.io wrote:
In coarse grain mode, the spark executors are launched and kept running
while the scheduler is running. So if you have a spark shell launched and
remained open, the executors are running and won't finish until
Hi All,
I wonder if anyone has any experience with building Gradient Boosted Tree
models in MLlib, and can help me out. I'm trying to create a plot of the test
error rate of a Gradient Boosted Tree model as a function of number of trees,
to determine the optimal number of trees in the model.
Hi,
I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually
installed spark1.1 in it. but when I started the slaves, I got an error as
follows,
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
Error: Could not find or load main class
This could be cause by many things including wrong configuration. Hard
to tell with just the info you provided.
Is there any reason why you want to use your own Spark instead of the
one shipped with CDH? CDH 5.3 has Spark 1.2, so unless you really need
to run Spark 1.1, you should be better off
Sorry- replace ### with an actual number. What does a skipped stage mean?
I'm running a series of jobs and it seems like after a certain point, the
number of skipped stages is larger than the number of actual completed
stages.
On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote:
Hello,
We are currently running our data pipeline on spark which uses Cassandra as
the data source.
We are currently facing issue with the step where we create an rdd on data
in cassandra table and then try to run flatMapToPair to transform the
data but we are running into Too many open files. I
Have you looked at Spark SQL
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables?
It supports HiveQL, can read from the hive metastore, and does not require
hadoop.
On Wed, Jan 7, 2015 at 8:27 AM, jamborta jambo...@gmail.com wrote:
Hi all,
We have been building a system
+Josh, who added the Job UI page.
I've seen this as well and was a bit confused about what it meant. Josh, is
there a specific scenario that creates these skipped stages in the Job UI ?
Thanks
Shivaram
On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote:
Sorry- replace ###
Did you end up getting it working? By the way this might be a nicer view of
the docs:
https://github.com/apache/spark/blob/60e2d9e2902b132b14191c9791c71e8f0d42ce9d/docs/job-scheduling.md
We will update the latest Spark docs to include this shortly.
-Andrew
2015-01-04 4:44 GMT-08:00 Tsuyoshi
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
Looks like the number of skipped stages couldn't be formatted.
Cheers
On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.
I understand that I've to create 10 parallel streams. My code is running
fine when the no of partitions is ~20, but when I increase the no of
partitions I keep getting in this issue.
Below is my code to create kafka streams, along with the configs used.
MapString, String kafkaConf = new
- You are launching up to 10 threads/topic per Receiver. Are you sure your
receivers can support 10 threads each ? (i.e. in the default configuration, do
they have 10 cores). If they have 2 cores, that would explain why this works
with 20 partitions or less.
- If you have 90 partitions, why
Yes, the distribution is certainly fine and built for Hadoop 2. It sounds
like you are inadvertently including Spark code compiled for Hadoop 1 when
you run your app. The general idea is to use the cluster's copy at runtime.
Those with more pyspark experience might be able to give more useful
Hi,
I applied it as fallows:
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -
1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)val counts = eegStreams(a).map(x
= math.round(x.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
val sortedCounts =
The cache command caches the entire table, with each column stored in its
own byte buffer. When querying the data, only the columns that you are
asking for are scanned in memory. I'm not sure what mechanism spark is
using to report the amount of data read.
If you want to read only the data that
Ah I see - So its more like 're-used stages' which is not necessarily a bug
in the program or something like that.
Thanks for the pointer to the comment
Thanks
Shivaram
On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com
wrote:
That's what you want to see. The computation of
General ideas regarding too many open files:
Make sure ulimit is actually being set, especially if you're on mesos
(because of https://issues.apache.org/jira/browse/MESOS-123 ) Find the pid
of the executor process, and cat /proc/pid/limits
set spark.shuffle.consolidateFiles = true
try
I'm on Spark 1.2.0, with Scala 1.11.2, and SBT 0.13.7.
When running:
case class Test(message: String)
val sc = new SparkContext(local, shell)
val sqlContext = new SQLContext(sc)
import sqlContext._
val testing = sc.parallelize(List(Test(this), Test(is), Test(a),
Test(test)))
I'm new to Spark Streaming. From the programming guide I saw there is this
JavaStreamingContext.socketTextStream() API that connects to a server and
grab the content to process. My requirement is a slightly different: I used
to have listening server that receives (not go out to grab) messages from
Thank you Cody!!
I am going to try with the two settings you have mentioned.
We are currently running with Spark standalone cluster manager.
Thanks
Ankur
On Wed, Jan 7, 2015 at 1:20 PM, Cody Koeninger c...@koeninger.org wrote:
General ideas regarding too many open files:
Make sure ulimit
That's what you want to see. The computation of a stage is skipped if the
results for that stage are still available from the evaluation of a prior
job run:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L163
On Wed, Jan 7, 2015
Thanks Michael.
2015-01-08 6:04 GMT+08:00 Michael Armbrust mich...@databricks.com:
The cache command caches the entire table, with each column stored in its
own byte buffer. When querying the data, only the columns that you are
asking for are scanned in memory. I'm not sure what mechanism
Hi,
I've heard a lot of complain about spark's pull style shuffle. Is
there any plan to support push style shuffle in the near future?
Currently, the shuffle phase must be completed before the next stage
starts. While, it is said, in Impala, the shuffled data is streamed to
the next
Hi,
I have a stupid question:
Is it possible to use spark on Teradata data warehouse, please? I read some
news on internet which say yes. However, I didn't find any example about
this issue
Thanks in advance.
Cheers
Gen
Thanks Andrew, simple fix ☺.
From: Andrew Ash [mailto:and...@andrewash.com]
Sent: 07 January 2015 15:26
To: England, Michael (IT/UK)
Cc: user
Subject: Re: FW: No APPLICATION_COMPLETE file created in history server log
location upon pyspark job success
Hi Michael,
I think you need to
I asked this question too soon. I am caching off a bunch of RDDs in a
TrieMap so that our framework can wire them together and the locking was
not completely correct- therefore it was creating multiple new RDDs at
times instead of using cached versions- which were creating completely
separate
Hi,
When I run jobs and save the event logs, they are saved with the permissions of
the unix user and group that ran the spark job. The history server is run as a
service account and therefore can’t read the files:
Extract from the History server logs:
2015-01-07 15:37:24,3021 ERROR Client
I guess I can but it would be nicer if that is made a configuration, I can
create the issue, test and PR if you guys think its appropiate
On Wed, Jan 7, 2015 at 1:41 PM, Sean Owen so...@cloudera.com wrote:
Ah, Fernando means the usersOut / productsOut RDDs, not the intermediate
links RDDs.
Hi All,
I run a Spark streaming application (Spark 1.2.0) on YARN (Hadoop 2.5.2) with
Spark event log enabled. I set the checkpoint dir in the streaming context and
run the app. It started in YARN with application id 'app_id_1' and created the
Spark event log dir
Hi,
It worked out as this.
val topCounts = sortedCounts.transform(rdd = rdd.zipWithIndex().filter(x=x._2
=10))
Regards,Laeeq
On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed
laeeqsp...@yahoo.com.INVALID wrote:
Hi Yana,
I also think thatval top10 = your_stream.mapPartitions(rdd =
Oh yeah. In that case you can simply repartition it into 1 and do
mapPartition.
val top10 = mysream.repartition(1).mapPartitions(rdd = rdd.take(10))
On 7 Jan 2015 21:08, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
Hi,
I applied it as fallows:
eegStreams(a) = KafkaUtils.createStream(ssc,
Hi Yana,
I also think thatval top10 = your_stream.mapPartitions(rdd = rdd.take(10))
will give top 10 from each partition. I will try your code.
Regards,Laeeq
On Wednesday, January 7, 2015 5:19 PM, Yana Kadiyska
yana.kadiy...@gmail.com wrote:
My understanding is that
val top10 =
Ah, Fernando means the usersOut / productsOut RDDs, not the intermediate
links RDDs.
Can you unpersist() them, and persist() again at the desired level? the
downside is that this might mean recomputing and repersisting the RDDs.
On Wed, Jan 7, 2015 at 5:11 AM, Xiangrui Meng men...@gmail.com
AFAIK, there're two levels of parallelism related to the Spark Kafka
consumer:
At JVM level: For each receiver, one can specify the number of threads for
a given topic, provided as a map [topic - nthreads]. This will
effectively start n JVM threads consuming partitions of that kafka topic.
At
Hi all,
We have been building a system where we heavily reply on hive queries
executed through spark to load and manipulate data, running on CDH and yarn.
I have been trying to explore lighter setups where we would not have to
maintain a hadoop cluster, just run the system on spark only.
Is it
thanks, I found the issue, I was including
/usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar into
the classpath - this was breaking it. now using custom jar with just the python
convertors and all works as a charm.thanks,antony.
On Wednesday, 7 January 2015,
O'Reilly + Databricks does certification:
http://www.oreilly.com/data/sparkcert
Databricks does training: http://databricks.com/spark-training
Cloudera does too:
http://www.cloudera.com/content/cloudera/en/training/courses/spark-training.html
That said, I am not sure you need a certificate to
This particular case shouldn't cause problems since both of those
libraries are java-only (the scala version appended there is just for
helping the build scripts).
But it does look weird, so it would be nice to fix it.
On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com
There is some serialization overhead. You can try
https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
. -Xiangrui
On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote:
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector.
Hi,
I am sorry to bother you, but I couldn't find any information about online
test of spark certification managed through Kryterion.
Could you please give me the link about it?
Thanks a lot in advance.
Cheers
Gen
On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote:
Hi
Could you try it on AWS using EMR? That'd give you an exact replica of the
environment that causes the error.
Sent from my iPhone
On Jan 7, 2015, at 10:53 AM, Davies Liu dav...@databricks.com wrote:
Hey Sven,
I tried with all of your configurations, 2 node with 2 executors each,
but in
For online, use the http://www.oreilly.com/go/sparkcert link to sign up via
O'Reilly. They will send details -- the announcement is being prepared.
On Wed, Jan 7, 2015 at 10:56 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
I am sorry to bother you, but I couldn't find any information about
The Spark code generates the log directory with 770 permissions. On
top of that you need to make sure of two things:
- all directories up to /apps/spark/historyserver/logs/ are readable
by the user running the history server
- the user running the history server belongs to the group that owns
Hey Sven,
I tried with all of your configurations, 2 node with 2 executors each,
but in standalone mode,
it worked fine.
Could you try to narrow down the possible change of configurations?
Davies
On Tue, Jan 6, 2015 at 8:03 PM, Sven Krasser kras...@gmail.com wrote:
Hey Davies,
Here are some
this is official cloudera compiled stack cdh 5.3.0 - nothing has been done by
me and I presume they are pretty good in building it so I still suspect it now
gets the classpath resolved in different way?
thx,Antony.
On Wednesday, 7 January 2015, 18:55, Sean Owen so...@cloudera.com wrote:
Hi Michael,
I think you need to explicitly call sc.stop() on the spark context for it
to close down properly (this doesn't happen automatically). See
https://issues.apache.org/jira/browse/SPARK-2972 for more details
Andrew
On Wed, Jan 7, 2015 at 3:38 AM, michael.engl...@nomura.com wrote:
Hi Saurabh,
In your area, Big Data Partnership provides Spark training:
http://www.bigdatapartnership.com/
As Sean mentioned, there is a certification program via a partnership
between O'Reilly Media and Databricks http://www.oreilly.com/go/sparkcert
That is offered in two ways, in-person at
Hello,
I have problem with join of two tables via Spark - I have tried to do it
via Spark SQL and API but no progress so far. I have basicaly two tables
ACCONTS - 16 mio records and TRANSACTIONS 2,5 billion records. When I try to
join the tables (please see code) the job stucks in the last
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0):
def aggregate_partition_values(vec1, vec2) :
vec1[vec2.indices] += vec2.values
return vec1
def
97 matches
Mail list logo