Hi,
After a spark program completes, there are 3 temporary directories remain in
the temp directory.
The file names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
And the Spark program runs on Windows, a snappy DLL file also remains in the
temp directory.
The file name is like
2*2 cents
1. You can try repartition and give a large number to achieve smaller
partitions.
2. OOM errors can be avoided by increasing executor memory or using off
heap storage
3. How are you persisting? You can try using persist using DISK_ONLY_SER
storage level
4. You may take a look in the
Hi.
No. of partitions are determined by the RDD it uses in the plan it creates.
It uses NewHadoopRDD which gives partitions by getSplits of input format it
is using. It uses FilteringParquetRowInputFormat subclass of
ParquetInputFormat. To change the no of partitions write a new input format
and
How i can stop Spark to stop triggering second attempt in case the first
fails.
I do not want to wait for the second attempt to fail again so that i can
debug faster.
.set(spark.yarn.maxAppAttempts, 0) OR .set(spark.yarn.maxAppAttempts,
1)
is not helping.
--
Deepak
On Thu, May 7, 2015 at 10:18 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:
Got it!
I'll open a Jira ticket and PR when I have a working solution.
Scratch that, I found SPARK-5281
https://issues.apache.org/jira/browse/SPARK-5281..
On Wed, May 6, 2015 at 11:53 PM, Michael Armbrust
Got it!
I'll open a Jira ticket and PR when I have a working solution.
On Wed, May 6, 2015 at 11:53 PM, Michael Armbrust mich...@databricks.com
wrote:
Hi Iulian,
The relevant code is in ScalaReflection
Hi all!
I'm using Spark 1.3.0 and I'm struggling with a definition of a new type
for a project I'm working on. I've created a case class Person(name:
String) and now I'm trying to make Spark to be able serialize and
deserialize the defined type. I made a couple of attempts but none of them
did
The multi-threading code in Scala is quite simple and you can google it
pretty easily. We used the Future framework. You can use Akka also.
@Evo My concerns for filtering solution are: 1. Will rdd2.filter run before
rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?
On
i am trying to launch the spark 1.3.1 history server on a secure cluster.
i can see in the logs that it successfully logs into kerberos, and it is
replaying all the logs, but i never see the log message that indicate the
web server is started (i should see something like Successfully started
Sorry for the confusion. SQLContext doesn't have a persistent metastore so
its not possible to save data as a table. If anyone wants to contribute,
I'd welcome a new query planner strategy for SQLContext that gave a better
error message.
On Thu, May 7, 2015 at 8:41 AM, Judy Nash
Spark SQL using the Data Source API can also do this with much less code
https://twitter.com/michaelarmbrust/status/579346328636891136.
https://github.com/databricks/spark-avro
On Thu, May 7, 2015 at 8:41 AM, Jonathan Coveney jcove...@gmail.com wrote:
A helpful example of how to convert:
I would suggest also looking at: https://github.com/databricks/spark-avro
On Wed, May 6, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Hello,
This is how i read Avro data.
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import
The history server may need several hours to start if you have a lot of
event logs. Is it stuck, or still replaying logs?
Best Regards,
Shixiong Zhu
2015-05-07 11:03 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
Can you get a jstack for the process? Maybe it's stuck somewhere.
On Thu, May 7,
SPARK-5522 is really cool. Didn't notice it.
Best Regards,
Shixiong Zhu
2015-05-07 11:36 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
That shouldn't be true in 1.3 (see SPARK-5522).
On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote:
The history server may need several
Hi Bill,
I just found weird that one would use parallel threads to 'filter', as
filter is lazy in Spark, and multithreading wouldn't have any effect unless
the action triggering the execution of the lineage containing such filter
is executed on a separate thread. One must have very specific
Hi Guys,
I think this problem is related to :
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html
I am running pyspark 1.2.1 in AWS with my AWS credentials exported to master
node as Environmental Variables.
Halfway through my application, I
That shouldn't be true in 1.3 (see SPARK-5522).
On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote:
The history server may need several hours to start if you have a lot of
event logs. Is it stuck, or still replaying logs?
Best Regards,
Shixiong Zhu
2015-05-07 11:03
seems i got one thread spinning 100% for a while now, in
FsHistoryProvider.initialize(). maybe something wrong with my logs on hdfs
that its reading? or could it simply really take 30 mins to read all the
history on dhfs?
jstack:
Deadlock Detection:
No deadlocks found.
Thread 2272: (state =
I'm also getting the same error.
Any ideas?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-unit-test-fails-tp22368p22798.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
got it. thanks!
On Thu, May 7, 2015 at 2:52 PM, Marcelo Vanzin van...@cloudera.com wrote:
Ah, sorry, that's definitely what Shixiong mentioned. The patch I
mentioned did not make it into 1.3...
On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote:
seems i got one thread
Hi,
Sorry this may be a little off topic but I tried searching for docs on history
server but couldn't really find much. Can someone point me to a doc or give me
a point of reference for the use and intent of a history server?
-- Ankur
On 7 May 2015, at 12:06, Koert Kuipers
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD call. After the input format runs, I want to do some directory
cleanup and I want to block while I'm doing that. Is that something I can
do inside of this function? If not, where would I accomplish this on every
Can anyone please explain -
println(Initalizaing the the KMeans model...)
val model = new
KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect())
where modelfile is *directory to persist the model while training *
REF-
Ah, sorry, that's definitely what Shixiong mentioned. The patch I mentioned
did not make it into 1.3...
On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote:
seems i got one thread spinning 100% for a while now, in
FsHistoryProvider.initialize(). maybe something wrong with my
I bet you are running on YARN in cluster mode.
If you are running on yarn in client mode,
.set(“spark.yarn.maxAppAttempts”,”1”) works as you expect,
because YARN doesn’t start your app on the cluster until you call
SparkContext().
But If you are running on yarn in cluster mode, the driver
Can you get a jstack for the process? Maybe it's stuck somewhere.
On Thu, May 7, 2015 at 11:00 AM, Koert Kuipers ko...@tresata.com wrote:
i am trying to launch the spark 1.3.1 history server on a secure cluster.
i can see in the logs that it successfully logs into kerberos, and it is
One of the best discussion in mailing list :-) ...Please help me in
concluding --
The whole discussion concludes that -
1- Framework does not support increasing parallelism of any task just by
any inbuilt function .
2- User have to manualy write logic for filter output of upstream node
I am currently using pyspark with a virtualenv.
Unfortunately I don't have access to the nodes file system and therefore I
cannot manually copy the virtual env over there.
I have been using this technique:
I first add a tar ball with the venv
sc.addFile(virtual_env_tarball_file)
Then in
I checked the data again, no skewed data in it, it is just txt files with
sereral string and int fields. that's it. I also followed the suggestions in
tuning guild page, refer to
http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
I will keep on inspecting why those left
Hi James,
If you’re on Spark 1.3 you can use the kill command in spark-submit to shut it
down. You’ll need the driver id from the Spark UI or from when you submitted
the app.
spark-submit --master spark://master:7077 --kill driver-id
Thanks,
Silvio
From: James King
Date: Wednesday, May 6,
Thanks, but it seems that the option is for Spark standalone mode only.
I’ve (lightly) tested the options with local mode and yarn-client mode, the
‘temp’ directories were not deleted.
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, May 07, 2015 10:47 PM
To: Todd Nist
Cc: Taeyun
A KMeansModel was trained in the previous step, and it was saved to
modelFile as a Java object file. This step is loading the model back and
reconstructing the KMeansModel, which can then be used to classify new
tweets into different clusters.
Joseph
On Thu, May 7, 2015 at 12:40 PM, anshu shukla
Hi I am pretty new to spark and I am trying to implement a simple spark
streaming application using Meetup's RSVP stream: stream.meetup.com/2/rsvps
Any idea how to connect the stream to Spark Streaming?
I am trying out rawSocketStream but not sure what the parameters are(viz.
port)
Thank you
Hi,
I have a question regarding one of the oddities we encountered while running
mllib's column similarities operation. When we examine the output, we find
duplicate matrix entries (the same i,j). Sometimes the entries have the same
value/similarity score, but they're frequently different too.
MyClass is a basic scala case class (using Spark 1.3.1);
case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int,
ipi: Double) {
override def hashCode(): Int = crn.hashCode()
}
On Wed, May 6, 2015 at 8:09 PM, ayan guha guha.a...@gmail.com wrote:
How does your MyClqss looks like?
I believe this is a regression. Does not work for me either. There is a Jira on
parquet wildcards which is resolved, I'll see about getting it reopened
Sent on the new Sprint Network from my Samsung Galaxy S®4.
div Original message /divdivFrom: Vaxuki
vax...@gmail.com
Hi all,
I use a function to create or return context for spark application, in this
function I load some resources from text file to a list. My question is how
to update a list?
Olivier
Nope. Wildcard extensions don't work I am debugging the code to figure out
what's wrong I know I am using 1.3.1 for sure
Pardon typos...
On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote:
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?
Le jeu. 7
I give the executor 14gb and would like to cut it.
I expect the critical operations to run hundreds of millions of times which
is why we run on a cluster. I will try DISK_ONLY_SER
Thanks
Steven Lewis sent from my phone
On May 7, 2015 10:59 AM, ayan guha guha.a...@gmail.com wrote:
2*2 cents
1.
Funny enough, I observe different behaviour on EC2 vs EMR (Spark on EMR
installed with
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark). Both
with Spark 1.3.1/Hadoop 2.
Reading a folder with 12 Parquet gives
This shouldn't be happening, do you have an example to reproduce it?
On Thu, May 7, 2015 at 4:17 PM, rbolkey rbol...@gmail.com wrote:
Hi,
I have a question regarding one of the oddities we encountered while
running
mllib's column similarities operation. When we examine the output, we find
Hi all,
I'd like to monitor the akka using kamon, which need to set the
akka.extension to a list like this in typesafe config format:
akka {
extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD]
}
But i can not find a way to do this, i have tried these:
1.
Hi all,
Thanks for the help on this case!
we finally settle this by adding a jar named: parquet-hive-bundle-1.5.0.jar
when submitting jobs through spark-submit,
where this jar file does not exist in our CDH5.3 anyway (we've downloaded it
from
On Thu, May 7, 2015 at 7:39 PM, felicia shsh...@tsmc.com wrote:
we tried to add /usr/lib/parquet/lib /usr/lib/parquet to SPARK_CLASSPATH
and it doesn't seems to work,
To add the jars to the classpath you need to use /usr/lib/parquet/lib/*,
otherwise you're just adding the directory (and not
You're referring to a comment in the generic utility method, not the
specific calls to it. The comment just says that the generic method
doesn't mark the directory for deletion. Individual uses of it might
need to.
One or more of these might be delete-able on exit, but in any event
it's just a
The Spark documentation shows the following example code:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical
features
val discretizedData = data.map { lp =
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16
} ) )
}
I'm sort of missing why x / 16
Hi guys, I got a PhoenixParserException: ERROR 601
(42P00): Syntax error. Encountered FORMAT at line 21, column 141.
when creating a table by using ROW FORMAT DELIMITED FIELDS TERMINATED
BY '\t' LINES TERMINATED BY '\n'. As I remember, previous
version phoenix support this grammar,
Hi all,
I'm able to run SparkSQL through python/java and retrieve data from ordinary
table,
but when trying to fetch data from parquet table, following error shows up:\
which is pretty straight-forward indicating that parquet-related class was
not found;
we tried to add /usr/lib/parquet/lib
Hi all,
Recently in our project, we need to update a RDD using data regularly
received from DStream, I plan to use foreachRDD API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd =
MyRDD = MyRDD.join(rdd)...
...
}
Is this usage correct? My concern is, as I am repeatedly and
I think the temporary folders are used to store blocks and shuffles.
That doesn't depend on the cluster manager.
Ideally they should be removed after the application has been
terminated.
Can you check if there are contents under those folders?
From: Taeyun
I tried small spark.sql.shuffle.partitions = 16,so that every task will fetch
generally equal size of data,however every task runs still slow.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:luohui20...@sina.com
收件人:luohui20001 luohui20...@sina.com,
I think I’ve found the (maybe partial, but major) reason.
It’s between the following lines, (it’s newly captured, but essentially the
same place that Zoltán Zvara picked:
15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend:
Hello,
I am following the tutorial code on sql programming
guidehttps://spark.apache.org/docs/1.2.1/sql-programming-guide.html#inferring-the-schema-using-reflection
to try out Python on spark 1.2.1.
SaveAsTable function works on Scala bur fails on python with Unresolved plan
found.
Broken
Has there been any follow up on this topic?
Here http://search-hadoop.com/m/q3RTtm4EtI1hoD8d2 there were suggestions
that someone was going to publish some code, but no news since (TD himself
looked pretty interested).
Did anybody come up with something in the last months?
--
View this
Hi all!
I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for
a project I'm working on. I've created a case class Person(name: String) and
now I'm trying to make Spark to be able serialize and deserialize the
defined type. I made a couple of attempts but none of them did
Default value for spark.worker.cleanup.enabled is false:
private val CLEANUP_ENABLED =
conf.getBoolean(spark.worker.cleanup.enabled, false)
I wonder if the default should be set as true.
Cheers
On Thu, May 7, 2015 at 6:19 AM, Todd Nist tsind...@gmail.com wrote:
Have you tried to set the
Have you tried to set the following?
spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=seconds”
On Thu, May 7, 2015 at 2:39 AM, Taeyun Kim taeyun@innowireless.com
wrote:
Hi,
After a spark program completes, there are 3 temporary directories remain
in the temp
Where can I find your blog ?
Thanks
On Apr 29, 2015, at 7:14 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
After day of debugging (actually, more), I can answer my question:
The problem is that the default value 200 of “spark.sql.shuffle.partitions”
is too small for sorting 2B
Hi V,
I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them.
For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]
I haven't personally used this with hdfs, but I've worked with a similar
Hi Dan,
In
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala,
you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate
a term.
It's also mentioned in the doc that Maps a sequence of terms to their
term frequencies
It does look the function that's executed is in the driver so doing an
Await.result() on a thread AFTER i've executed an action should work. Just
updating this here in case anyone has this question in the future.
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD
SPARK-4825https://issues.apache.org/jira/browse/SPARK-4825 looks like the
right bug, but it should've been fixed on 1.2.1.
Is a similar fix needed in Python?
From: Judy Nash
Sent: Thursday, May 7, 2015 7:26 AM
To: user@spark.apache.org
Subject: saveAsTable fails on Python with Unresolved plan
By default you would expect to find the logs files for master and workers
in the relative `logs` directory from the root of the Spark installation on
each of the respective nodes in the cluster.
On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Ø Can
A helpful example of how to convert:
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
As far as performance, that depends on your data. If you have a lot of
columns and use all of them, parquet deserialization is expensive. If you
have a column and only need a few
Thanks for the replies. We decided to use concurrency in Scala to do the
two mappings using the same source RDD in parallel. So far, it seems to be
working. Any comments?
On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote:
RDD1 = RDD.filter()
RDD2 = RDD.filter()
*From:*
There's an open PR to fix it: https://github.com/apache/spark/pull/5966
On Thu, May 7, 2015 at 6:07 PM, Koert Kuipers ko...@tresata.com wrote:
i am having no luck using the 1.4 branch with scala 2.11
$ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package
[error]
Figured it out. It was because I was using HiveContext instead of SQLContext.
FYI in case others saw the same issue.
From: Judy Nash
Sent: Thursday, May 7, 2015 7:38 AM
To: 'user@spark.apache.org'
Subject: RE: saveAsTable fails on Python with Unresolved plan found
i am having no luck using the 1.4 branch with scala 2.11
$ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package
[error]
/home/koert/src/opensource/spark/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78:
in object RDDOperationScope, multiple overloaded
Hi,
I think you may want to use this setting?:
spark.task.maxFailures4Number of individual task failures before giving up
on the job. Should be greater than or equal to 1. Number of allowed retries
= this value - 1.
On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
How
I should also add I've recently seen this issue as well when using collect.
I believe in my case it was related to heap space on the driver program not
being able to handle the returned collection.
On Thu, May 7, 2015 at 11:05 AM, Richard Marscher rmarsc...@localytics.com
wrote:
By default you
Hi,
is there any best practice to do like in MLLib a randomSplit of
training/cross-validation set with dataframes and the pipeline API ?
Regards
Olivier.
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928
Looks like for now you'd have to list the full paths...I don't see a
comment from an official spark committer so still not sure if this is a bug
or design, but it seems to be the current state of affairs.
On Thu, May 7, 2015 at
Ø Can you check your local and remote logs?
Where are the log files? I see the following in my Driver program logs as well
as the Spark UI failed task page
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_2_piece0 of broadcast_2
Here is the detailed stack trace.
1) What is the best way to convert data from Avro to Parquet so that it can
be later read and processed ?
2) Will the performance of processing (join, reduceByKey) be better if both
datasets are in Parquet format when compared to Avro + Sequence ?
--
Deepak
Hi Bill,
Could you show a snippet of code to illustrate your choice?
-Gerard.
On Thu, May 7, 2015 at 5:55 PM, Bill Q bill.q@gmail.com wrote:
Thanks for the replies. We decided to use concurrency in Scala to do the
two mappings using the same source RDD in parallel. So far, it seems to be
After thinking about it more, I do think weighting lambda by sum_i cij is
the equivalent of the ALS-WR paper's approach for the implicit case. This
provides scale-invariance for varying products/users and for varying ratings,
and should behave well for all alphas. What do you guys think?
On Wed,
Scala is a language, Spark is an OO/Functional, Distributed Framework
facilitating Parallel Programming in a distributed environment
Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark
OO Framework – ie it is limited in terms of what it can achieve in terms of
The answer for Spark SQL “order by” is setting spark.sql.shuffle.partitions to
a bigger number. For RDD.sortBy it works out of the box if RDD has enough
number of partitions.
From: Night Wolf [mailto:nightwolf...@gmail.com]
Sent: Thursday, May 07, 2015 5:26 AM
To: Ulanov, Alexander
Cc:
avulanov.blogspot.com, though it does not have more on this particular issue
than I already posted.
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, May 07, 2015 6:25 AM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Sort (order by) of the big dataset
Where can I find
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?
Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :
Spark 1.3.1 -
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
I have a small application configured to use 6 cpu cores and I run it on
standalone cluster. Such configuration means that only 6 task can be active
in one moment and if all of them are waitng(IO for example) then not whole
CPU is used.
My questions:
1. Is it true that number of active tasks per
81 matches
Mail list logo