Hello,
my last reduce task in the job always fails with java.io.IOException:
Failed to save output of task when using saveAsTextFile with s3 endpoint
(all others are successful). Has anyone had similar problems?
https://gist.github.com/gregakespret/813b540faca678413ad4
-
14/05/21
Hi Nick,
Here is an illustrated example which extracts certain fields from Facebook
messages, each one is a JSON object and they are serialised into files with
one complete JSON object per line. Example of one such message:
CandyCrush.json https://gist.github.com/cotdp/131a1c9fc620ab7898c4
You
Hi,
with the hints from Gerard I was able to get my locally working Spark
code running on Mesos. Thanks!
Basically, on my local dev machine, I use sbt assembly to create a
fat jar (which is actually not so fat since I use ... % 'provided'
in my sbt file for the Spark dependencies), upload it to
Update: Partly user error. But still getting FS closed error.
Yes, we are running plain vanilla Hadoop 2.3.0. But it probably
doesn't matter
1. Tried Colin McCabe's suggestion to patch with pull 850
(https://issues.apache.org/jira/browse/SPARK-1898). No
HI!
We are currently using HBase as our primary data store of different event-like
data. On-top of that, we use Shark to aggregate this data and keep it
in memory for fast data access. Since we use no specific HBase functionality
whatsoever except Putting data into it, a discussion
came up on
Hi
In my opinion, running HBase for immutable data is generally overkill in
particular if you are using Shark anyway to cache and analyse the data and
provide the speed.
HBase is designed for random-access data patterns and high throughput R/W
activities. If you are only ever writing immutable
Is there a way to query fields by similarity (like Lucene or using a
similarity metric) to be able to query something like WHERE language LIKE
it~0.5 ?
Best,
Flavio
On Thu, May 22, 2014 at 8:56 AM, Michael Cutler mich...@tumra.com wrote:
Hi Nick,
Here is an illustrated example which
Hi,
I am running Spark streaming application. I have faced some uncaught
exception after which my worker stops processing any further messages.
I am using *spark 0.9.0*
Could you please let me know what could be the cause of this and how to
overcome this issue?
[ERROR] [05/22/2014
Hi,
We observed strange behabiour of Spark 0.9.0 when using sc.stop().
We have a bunch of applications that perform some jobs and then issue
sc.stop() at the end of main. Most of the time, everything works as
desired, but sometimes the applications get marked as FAILED by the
master and all
Hi,
Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
You should always call sc.stop(), so it cleans up state and does not fill
up your disk over time. The strange behavior you observe is mostly benign,
as it only occurs after you have supposedly finished all of your work with
the SparkContext. I am not aware of a bug in Spark that causes this
No exceptions in any logs. No errors in stdout or stderr.
2014-05-22 11:21 GMT+02:00 Andrew Or and...@databricks.com:
You should always call sc.stop(), so it cleans up state and does not fill
up your disk over time. The strange behavior you observe is mostly benign,
as it only occurs after
i am using tachyon as storage system and using to shark to query a table
which is a bigtable, i have 5 machines as a spark cluster, there are 4
cores on each machine .
My question is:
1. how to set task number on each core?
2. where to see how many partitions of one RDD?
Hi Mayur,
Thanks for your help.
I'm not sure I understand what are the parameters i must give to
newAPIHadoopFile [ K , V , F : InputFormat [ K , V ] ] ( path: String ,
fClass: Class [ F ] , kClass: Class [ K ] , vClass: Class [ V ] , conf:
Configuration ) : JavaPairRDD [ K , V ]
It seems
i have added SPARK_JAVA_OPTS+=-Dspark.default.parallelism=40 in
shark-env.sh
2014-05-22 17:50 GMT+08:00 qingyang li liqingyang1...@gmail.com:
i am using tachyon as storage system and using to shark to query a table
which is a bigtable, i have 5 machines as a spark cluster, there are 4
my aim of setting task number is to increase the query speed,and I have
also found mapPartitionsWithIndex at
Operator.scala:333http://192.168.1.101:4040/stages/stage?id=17
is costing much time. so, my another question is :
how to tunning
Hi
Is there any way or command by which we can wipe/drop whole cache of shark
in one go.
Thanks
Vinay Bajaj
We have an instance of Spark running on top of Mesos and GlusterFS. Due to some
fixes of bugs that we also came across, we installed the latest versions:
1.0.0-rc9 (spark-1.0.0-bin-2.0.5-alpha, java 1.6.0_27), Mesos 0.18.1. Since
then, moderate sized tasks (10-20GB) cannot complete.
I notice
In my situation each slave has 8 GB memory. I want to use the maximum memory
that I can: .set(spark.executor.memory, ?g) How can I determine the amount
of memory I should set ? It fails when I set it to 8GB.
Hi,
We are moving into adopting the full stack of Spark. So far, we have used
Shark to do some ETL work, which is not bad but is not prefect either. We
ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig.
Do you have any suggestions with the ETL solution in Spark stack?
And
Hi,
I have bunch of vectors like
[0.1234,-0.231,0.23131]
and so on.
and I want to compute cosine similarity and pearson correlation using
pyspark..
How do I do this?
Any ideas?
Thanks
Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?
I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?
Thanks
Pierre
--
View this message in context:
unsubscribe
From: William Kang weliam.cl...@gmail.com
To: user@spark.apache.org
Date: 05/22/2014 10:50 AM
Subject:ETL and workflow management on Spark
Hi,
We are moving into adopting the full stack of Spark. So far, we have used
Shark to do some ETL work, which is not bad
SparkListener offers good stuffs.
But I also completed it with another metrics stuffs on my own that use Akka
to aggregate metrics from anywhere I'd like to collect them (without any
deps on ganglia yet on Codahale).
However, this was useful to gather some custom metrics (from within the
tasks
Hi Andy!
Yes Spark UI provides a lot of interesting informations for debugging purposes.
Here I’m trying to integrate a simple progress monitoring in my app ui.
I’m typically running a few “jobs” (or rather actions), and I’d like to be able
to display the progress of each of those in my ui.
I
Yeah, actually for that I used directly codahale with my own stuffs using
the Akka system from within Spark itself.
So the workers send messages back to a bunch of actors on the driver which
are using codahale metrics.
This way I can collect what/how an executor do/did, but also I can
aggregate
Hi, I am running into a pretty concerning issue with Shark (granted I'm
running v. 0.8.1).
I have a Spark slave node that has run out of disk space. When I try to
start Shark it attempts to deploy the application to a directory on that
node, fails and eventually gives up (I see a Master Removed
Hi.
I'm writing a pilot project, and plan on using spark's streaming app for it.
To start with I have a dump of some access logs with their own timestamps,
and am using the textFileStream and some old files to test it with.
One of the issues I've come across is simulating the windows. I would
Hi,
We are in process of migrating Pig on spark. What is your currrent Spark
setup?
Version cluster management that you use?
Also what is the datasize you are working with right now.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
I am having trouble adding logging to the class that does serialization and
deserialization. Where is the code for org.apache.spark.Logging located?
and is this serializable?
On Mon, May 12, 2014 at 10:02 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Ah, yes, that is correct. You
Hi all,
On an ARM cluster, I have been testing a wordcount program with JRE 7
and everything is OK. But when changing to the embedded version of
Java SE (Oracle's eJRE), the same program cannot complete all
computing stages.
It is failed by many Akka's disassociation.
- I've been trying to
Hi
After using sparks TestSuiteBase to run some tests I've noticed that at the
end, after finishing all tests the cleaner is still running and outputs the
following perdiodically:
INFO o.apache.spark.util.MetadataCleaner - Ran metadata cleaner for
SHUFFLE_BLOCK_MANAGER
I use method
I'm forwarding this email along which contains a question from a Spark user
Adrien (CC'd) who can't successfully get any emails through to the Apache
mailing lists.
Please reply-all when responding to include Adrien. See below for his
question.
-- Forwarded message --
From: Adrien
I have since resolved the issue. The problem was that multiple rdds were
trying to write to the same s3 bucket.
Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer
Celtra — Rich Media Mobile Advertising
celtra.com http://www.celtra.com/ |
I am not 100% sure of the functionality in Catalyst, probably the easiest
way to see what it supports is to look at
SqlParser.scalahttps://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scalain
GIT. Straight away I can see
LIKE, RLIKE and
Hi All
I am using Spark Streaming with Kafka, I recieve messages and after minor
processing I write them to HDFS, as of now I am using saveAsTextFiles() /
saveAsHadoopFiles() Java methods
- Is there some default way of writing stream to Hadoop like we have HDFS sink
concept in Flume? I mean
Hi,
I'm starting to explore the Spark Job Server contributed by Ooyala [1],
running from the master branch.
I started by developing and submitting a simple job and the JAR check gave
me errors on a seemingly good jar. I disabled the fingerprint checking on
the jar and I could submit it, but
Hi Gerard,
We're using the Spark Job Server in production, from GitHub [master]
running against a recent Spark-1.0 snapshot so it definitely works. I'm
afraid the only time we've seen a similar error was an unfortunate case of
PEBKAC http://en.wikipedia.org/wiki/User_error.
First and foremost,
Hey all,
I'm working through the basic SparkPi example on a YARN cluster, and i'm
wondering why my containers don't pick up the spark assembly classes.
I built the latest spark code against CDH5.0.0
Then ran the following:
Hi Jon,
Your configuration looks largely correct. I have very recently confirmed
that the way you launch SparkPi also works for me.
I have run into the same problem a bunch of times. My best guess is that
this is a Java version issue. If the Spark assembly jar is built with Java
7, it cannot be
I had this problem too and fixed it by setting the wait time-out to a larger
value: --wait
For example, in stand alone mode with default values, a time out of 480
seconds worked for me:
$ cd spark-0.9.1/ec2
$ ./spark-ec2 --key-pair= --identity-file= --instance-type=r3.large
--wait=480
I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
following directory on my data set. But I did not find the sample command
line to run them. Anybody help? Thanks.
spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
--
View this message in
There is a bin/run-example.sh example-class [args]
2014-05-22 12:48 GMT-07:00 yxzhao yxz...@ualr.edu:
I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
following directory on my data set. But I did not find the sample command
line to run them. Anybody help? Thanks.
Hi Ibrahim,
If your worker machines only have 8GB of memory, then launching executors
with all the memory will leave no room for system processes. There is no
guideline, but I usually leave around 1GB just to be safe, so
conf.set(spark.executor.memory, 7g)
Andrew
2014-05-22 7:23 GMT-07:00
Hi Michael,
Thanks for the tip on the /tmp dir. I had unzipped all the jars before
uploading to check for the class. The issue is that the jars were not
uploaded correctly.
I was not familiar with the '@' syntax of curl and omitted it, resulting in
a Jar file containing only the jar's name.
curl
Andrew,
Brilliant! I built on Java 7 but was still running our cluster on Java 6.
Upgraded the cluster and it worked (with slight tweaks to the args, I
guess the app args come first then yarn-standalone comes last):
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with
there Spark packages and none seem to package it.
Am I missing something? Is this only provided with the Spark project pre-built
binaries or from source installs?
Marco
On May 22, 2014, at 5:04 PM, Stephen
Ideally you should use less.. 75 % would be good to leave enough for
scratch space for shuffle writes system processes.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, May 23, 2014 at 1:41 AM, Andrew Or
Andrew,
I did not register anything explicitly based on the belief that the class
name is written out in full only once. I also wondered why that problem
would be specific to JodaTime and not show up with Java.util.date...I guess
it is possible based on internals of Joda time.
If I remove DateTime
I think you should be able to drop yarn-standalone altogether. We
recently updated SparkPi to take in 1 argument (num slices, which you set
to 10). Previously, it took in 2 arguments, the master and num slices.
Glad you got it figured out.
2014-05-22 13:41 GMT-07:00 Jon Bender
Spark uses the Twitter Chill library, which registers a bunch of core Scala
and Java classes by default. I'm assuming that java.util.Date is
automatically registered by that, but Joda's DateTime is not. We could
always take a look through the source to confirm too.
As far as the class name, my
Thanks Stephen,
I used the following commnad line to run the SVM, but it seems that the
path is not correct. What the right path or command line should be? Thanks.
*./bin/run-example org.apache.spark.mllib.classification.SVM
spark://100.1.255.193:7077 http://100.1.255.193:7077 train.csv 20*
Thanks.
I used the following commnad line to run the SVM, but it seems that the path
is not correct. What the right path or command line should be? Thanks.
./bin/run-example org.apache.spark.mllib.classification.SVM
spark://100.1.255.193:7077 train.csv 20
Exception in thread main
Hi All,
I am trying to run the network count example as a seperate standalone job
and running into some issues.
Environment:
1) Mac Mavericks
2) Latest spark repo from Github.
I have a structure like this
Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
.
./simple.sbt
./src
./src/main
Mohit, if you want to end up with (1 .. N) , why don't you skip the logic
for finding missing values, and generate it directly?
val max = myCollection.reduce(math.max)
sc.parallelize((0 until max))
In either case, you don't need to call cache, which will force it into
memory - you can do
How are you launching the application? sbt run ? spark-submit? local
mode or Spark standalone cluster? Are you packaging all your code into
a jar?
Looks to me that you seem to have spark classes in your execution
environment but missing some of Spark's dependencies.
TD
On Thu, May 22, 2014 at
I am running as sbt run. I am running it locally .
Thanks,
Shrikar
On Thu, May 22, 2014 at 3:53 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
How are you launching the application? sbt run ? spark-submit? local
mode or Spark standalone cluster? Are you packaging all your code into
a
Was the error message the same as you posted when you used `root` as
the user id? Could you try this:
1) Do not specify user id. (Default would be `root`.)
2) If it fails in the middle, try `spark-ec2 --resume launch
cluster` to continue launching the cluster.
Best,
Xiangrui
On Thu, May
I couldn't find the classification.SVM class.
- Most probably the command is something of the order of:
- bin/spark-submit --class
org.apache.spark.examples.mllib.BinaryClassification
examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv
- For more details
The cleaner should remain up while the sparkcontext is still active (not
stopped). However, here it seems you are stopping the sparkContext
(ssc.stop(true)), the cleaner should be stopped. However, there was a bug
earlier where some of the cleaners may not have been stopped when the
context is
Austin,
I made up a mock example...my real use case is more complex. I used
foreach() instead of collect/cache..that forces the accumulable to be
evaluated. On another thread Xiangrui pointed me to a sliding window rdd in
mlllib that is a great alternative (although I did not switch to using it)
Hi,
Im confused on what is the right way to use broadcast variables from java.
My code looks something like this:
Map val = //build Map to be broadcast
BroadcastMap broadastVar = sc.broadcast(val);
sc.textFile(...).map(new SomeFunction()) {
//Do something here using broadcastVar
}
My
Unsubscribe
How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you
publish Spark locally which allowed you to use it as a dependency?
This is a weird indeed. SBT should take care of all the dependencies of
spark.
In any case, you can try the last released Spark 0.9.1 and see if the
problem
Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
Thanks,
Shrikar
On Thu, May 22, 2014 at 8:53 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you
publish Spark locally which allowed you to use it as a
Try cleaning your maven (.m2) and ivy cache.
On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote:
Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
Thanks,
Shrikar
On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Hi,
I am trying to apply inner join in shark using 64MB and 27MB files. I am
able to run the following queris on Mesos
- SELECT * FROM geoLocation1
- SELECT * FROM geoLocation1 WHERE country = 'US'
But while trying inner join as
SELECT * FROM geoLocation1 g1 INNER JOIN
This is something we are interested as well. We are planning to investigate
more on this. If someone has suggestions, we would love to hear.
Chester
Sent from my iPad
On May 22, 2014, at 8:02 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi Andy!
Yes Spark UI provides a
Hi,
I tried clearing maven and ivy cache and I am a bit confused at this point
in time.
1) Running the example from the spark directory and running using
bin/run-example. It works fine as well as it prints the word counts.
2) Trying to run the same code as a seperate job.
*) Using the latest
69 matches
Mail list logo