I’m trying to do simple graph sort in Spark which I mostly have working.
The one problem I have now is that I need to order them and then assign a
rank position.
So the top item should have rank 0, the next one should have rank 1, etc.
Hive and Pig support this with the RANK operator.
I
Hi Xiangrui,
Thanks for the reply.
Julia code is also using the covariance matrix:
(1/n)*X'*X ;
Thanks,
Upul
On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng men...@gmail.com wrote:
The Julia code is computing the SVD of the Gram matrix. PCA should be
applied to the covariance matrix.
Cui Lin,
The solution largely depends on how you want your services deployed (Java
web container, Spray framework, etc...) and if you are using a cluster
manager like Yarn or Mesos vs. just firing up your own executors and master.
I recently worked on an example for deploying Spark services
Hi, Rishi
You are right. But the ids may be tens of thousands and B is a database with
index for id, which means querying by id is very fast.
In fact we load A and B as separate schemaRDDs as you suggested. But we hope we
can extend the join implementation to achieve it in the parsing stage.
Gents,
I'm building spark using the current master branch and deploying in to
Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop
cluster provisioning tool. bdutils configures Spark with
spark.local.dir=/hadoop/spark/tmp,
but this option is ignored in combination with
Not if broadcast can only be used between stages. To enable this you have
to at least make broadcast asynchronous non-blocking.
On 9 January 2015 at 18:02, Krishna Sankar ksanka...@gmail.com wrote:
I am also looking at this domain. We could potentially use the broadcast
capability in Spark to
Hi,
I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a
join step.
Basically, i have a RDD of rows, that i am joining with another RDD of
tuples.
Some of the tasks succeed but a fair number failed with OOM exception with
stack below. The stack belongs to the 'reducer'
Hello, All,
What’s the best practice on deploying/publishing spark-based scientific
applications into a web service? Similar to Shiny on R.
Thanks!
Best regards,
Cui Lin
How big is your data? Did you see other error messages from executors?
It seems to me like a shuffle communication error. This thread may be
relevant:
http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E
Hi,
The userId's and productId's in my data are bigInts, what is the best way to
run collaborative filtering on this data. Should I modify MLlib's
implementation to support more types? or is there an easy way.
Thanks!,
Nishanth
--
View this message in context:
Hi,
We are currently using spark to join data in Cassandra and then write the
results back into Cassandra. While reads happen with out any error during
the writes we see many exceptions like below. Our environment details are:
- Spark v 1.1.0
- spark-cassandra-connector-java_2.10 v 1.1.0
We are
Do you have more than 2 billion users/products? If not, you can pair
each user/product id with an integer (check RDD.zipWithUniqueId), use
them in ALS, and then join the original bigInt IDs back after
training. -Xiangrui
On Fri, Jan 9, 2015 at 5:12 PM, nishanthps nishant...@gmail.com wrote:
Hi,
You can also look at Spark Job Server
https://github.com/spark-jobserver/spark-jobserver
- Gaurav
On Jan 9, 2015, at 10:25 PM, Corey Nolet cjno...@gmail.com wrote:
Cui Lin,
The solution largely depends on how you want your services deployed (Java web
container, Spray framework, etc...)
You need to subtract mean values to obtain the covariance matrix
(http://en.wikipedia.org/wiki/Covariance_matrix).
On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara upulband...@gmail.com wrote:
Hi Xiangrui,
Thanks for the reply.
Julia code is also using the covariance matrix:
(1/n)*X'*X ;
As Sean said, this definitely sounds like something worth a JIRA issue (and
PR).
On Fri Jan 09 2015 at 8:17:34 AM Sean Owen so...@cloudera.com wrote:
(FWIW yes I think this should certainly be a POST. The link can become
a miniature form to achieve this and then the endpoint just needs to
colStats() computes the mean values along with several other summary
statistics, which makes it slower. How is the performance if you don't
use kryo? -Xiangrui
On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar rokros...@gmail.com wrote:
thanks for the suggestion -- however, looks like this is even
I'm seeing this exception when creating a new SparkContext in YARN:
[ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] -
[akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down
address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [
sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out
How do you guarantee that the two RDDs have the same size?
-Xiangrui
On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
1wil...@informatik.uni-hamburg.de wrote:
Hi Spark community,
I have
Thanks Sean.
I follow the guide, import the codebase into IntellijIdea as Maven project,
with the profiles:hadoop2.4 and yarn.
In the maven project view, I run Maven Install against the module: Spark
Project Parent POM(root).After a pretty long time, all the modules are built
successfully.
Spark uses MapReduce InputFormat implementations to read data from
disk, so in that sense it has access to, and uses, the same locality
info that things like MR do. Yes, tasks go to the data, and you want
to run Spark on top of the HDFS DataNodes. (Locality isn't always the
only priority that
I was looking for related information and found:
http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx
See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread for
how short circuit read is enabled.
Cheers
On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen
Note also for short circuit reads that early versions are actually
net-negative in performance. Only after a second hadoop release of the
feature did it turn towards being a positive change. See earlier threads
on this mailing list where short circuit reads are discussed.
On Fri, Jan 9, 2015 at
Hi all,
DeepLearning algorithms are popular and achieve many state of the art
performance in several real world machine learning problems. Currently
there are no DL implementation in spark and I wonder if there is an ongoing
work on this topics.
We can do DL in spark Sparkling water and H2O but
Pretty vague on details:
http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199
On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
DeepLearning algorithms are popular and achieve many state of the art
performance in several real world
Hi,
I would like to use OpenHashSet
(org.apache.spark.util.collection.OpenHashSet) in my standalone program. I
can import it without error as:
import org.apache.spark.util.collection.OpenHashSet
However, when I try to access it, I am getting an error as:
object OpenHashSet in package
Thanks, I imagine this will kill any cached RDDs if their files are beyond the
ttl?
Thanks
From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com]
Sent: 09 January 2015 15:29
To: England, Michael (IT/UK); user@spark.apache.org
Subject: Re: Cleaning up spark.local.dir automatically
You
You may like to look at spark.cleaner.ttl configuration which is infinite
by default. Spark has that configuration to delete temp files time to time.
On Fri Jan 09 2015 at 8:34:10 PM michael.engl...@nomura.com wrote:
Hi,
Is there a way of automatically cleaning up the spark.local.dir after
Boris,
Yes, as you mentioned, we are creating a new SparkContext for our Job. The
reason being, to define Apache Cassandra connection using SparkConf. We
hope, this also should work.
For uploading JAR, we followed
(1) Package JAR using *sbt package* command
(2) Use *curl --data-binary
Hi Marcelo,
On MapR, the mapr user can read the files using the NFS mount, however using
the normal hadoop fs -cat /... command, I get permission denied. As the history
server is pointing to a location on mapfs, not the NFS mount, I'd imagine the
Spark history server is trying to read the
Thanks, but, how to increase the tasks per core?
For example, if the application claims 10 cores, is it possible to launch
100 tasks concurrently?
On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra...@gmail.com wrote:
Hallo,
Based on experiences with other software in virtualized
I am using pre built *spark-1.2.0-bin-hadoop2.4* from *[1] *to submit spark
applications to yarn, I cannot find the pre built spark for *CDH-5.x*
versions. So, In my case the org.apache.hadoop.yarn.util.ConverterUtils class
is coming from the spark-assembly-1.1.0-hadoop2.4.0.jar which is part of
Hi,
When I fetch the Spark code base and import into Intellj Idea as SBT project,
then I build it with SBT, but there is compiling errors in the examples
module,complaining that the EventBatch and SparkFlumeProtocol,looks they should
be in
org.apache.spark.streaming.flume.sink package.
Not
Guys,
I have a question regarding to Spark 1.1 broadcast implementation.
In our pipeline, we have a large multi-class LR model, which is about 1GiB
size.
To employ the benefit of Spark parallelism, a natural thinking is to
broadcast this model file to the worker node.
However, it looks that
I think this was already answered on stackoverflow:
http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark
where the one additional idea would be:
If there were just one header line, in the first record, then the most
efficient way to filter it out is:
May be you can use wholeTextFiles method, which returns filename and content of
the file as PariRDD and ,then you can remove the first line from files.
-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com]
Sent: Friday, January 09, 2015 11:48 AM
To:
I am facing same exception in saveAsObjectFile. Have you found any solution ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21066.html
Sent
This is worker log, not executor log. The executor log can be found in
folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/
. -Xiangrui
On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote:
Please find the attached worker log.
I could see stream closed
In the current implementation of TorrentBroadcast, the blocks are
fetched one-by-one
in single thread, so it can not fully utilize the network bandwidth.
Davies
On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang yangjun...@gmail.com wrote:
Guys,
I have a question regarding to Spark 1.1 broadcast
You are not the first :) probably not the fifth to have the question.
parameter server is not included in spark framework and I've seen all kinds
of hacking to improvise it: REST api, HDFS, tachyon, etc.
Not sure if an 'official' benchmark implementation will be released soon
On 9 January 2015
I’m resurrecting this thread because I’m interested in doing transpose on a
RowMatrix.
There is this other thread too:
http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html
Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still
This is a little confusing, but that code path is actually going through
hive. So the spark sql configuration does not help.
Perhaps, try:
set parquet.compression=GZIP;
On Fri, Jan 9, 2015 at 2:41 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
Hello,
I tried to save a table created via the
Hi Raghavendra,
This makes a lot of sense. Thank you.
The problem is that I'm using Spark SQL right now to generate the parquet
file.
What I think I need to do is to use Spark directly and transform all rows
from SchemaRDD to avro objects and supply it to use saveAsNewAPIHadoopFile
(from the
The other thing to note here is that Spark SQL defensively copies rows when
we switch into user code. This probably explains the difference between 1
2.
The difference between 1 3 is likely the cost of decompressing the column
buffers vs. accessing a bunch of uncompressed primitive objects.
Have you found any resolution for this issue ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
No, do you have any idea?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: firemonk9 [via Apache Spark User List]
[mailto:ml-node+s1001560n21067...@n3.nabble.com]
Sent: Friday, January 09, 2015 2:56 PM
To: Wang, Ningjun
I am running the following (connecting to an external Hive Metastore)
/a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf
*spark.sql.parquet.filterPushdown=true*
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
and then ran two queries:
sqlContext.sql(select count(*)
I know that in order to clean up the state for a key I have to return None
when I call updateStateByKey. However, as far as I understand,
updateStateByKey only gets called for new keys (i.e. keys in current
batch), not for all keys in the DStream.
So, how can I clear the state for those keys in
Yes that is the correct JIRA. It should make it to 1.3.
Best,
Reza
On Fri, Jan 9, 2015 at 11:13 AM, Adrian Mocanu amoc...@verticalscope.com
wrote:
I’m resurrecting this thread because I’m interested in doing transpose
on a RowMatrix.
There is this other thread too:
I am running Spark 1.1.1 built against CDH4 and have a few questions
regarding Spark performance related to co-location with HDFS nodes.
I want to know whether (and how efficiently) Spark takes advantage of being
co-located with a HDFS node?
What I mean by this is: if a file is being read by
This problem got resolved. In spark-submit I was using only
--driver-class-path option, once i added --jars option so that the workers
are aware of the dependant jar files, the problem went away. Need to check
if there are any worker logs that gives better information than the
exception I was
That's a worker setting which cleans up the files left behind by executors,
so spark.cleaner.ttl isn't at the RDD level. After
https://issues.apache.org/jira/browse/SPARK-1860 the cleaner won't clean up
directories left by running executors.
On Fri, Jan 9, 2015 at 7:38 AM,
Does it makes sense to use Spark's actor system (e.g. via
SparkContext.env.actorSystem) to create parameter server?
On Fri, Jan 9, 2015 at 10:09 PM, Peng Cheng rhw...@gmail.com wrote:
You are not the first :) probably not the fifth to have the question.
parameter server is not included in
I am also looking at this domain. We could potentially use the broadcast
capability in Spark to distribute the parameters. Haven't thought thru yet.
Cheers
k/
On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote:
Does it makes sense to use Spark's actor system (e.g. via
Read this:
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
In Spark Streaming apps if I enable ssc.checkpoint(dir) does this
checkpoint all RDDs? Or is it just checkpointing windowing and state RDDs?
For example, if in a DStream I am using an iterative algorithm on a
non-state non-window RDD, do I have to checkpoint it explicitly myself, or
can I assume
What's up with the IJ questions all of the sudden?
This PR from yesterday contains a summary of the answer to your question:
https://github.com/apache/spark/pull/3952 :
Rebuild Project can fail the first time the project is compiled,
because generate source files are not automatically generated.
You can try the following:
- Increase spark.akka.frameSize (default is 10MB)
- Try using torrentBroadcast
Thanks
Best Regards
On Fri, Jan 9, 2015 at 3:41 PM, Jun Yang yangjun...@gmail.com wrote:
Guys,
I have a question regarding to Spark 1.1 broadcast implementation.
In our pipeline, we
hi,when i run a query in spark sql ,there give me follow error,what's
processible reason can casuse this problem
ava.io.EOFException
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:148)
at
Hi,
I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes,
during Naive Baye's training, I get OptionalDataException at line,
map at NaiveBayes.scala:109
I am getting following exception on the console,
java.io.OptionalDataException:
You can parallelize on the driver side. The way to do it is almost
exactly what you have here, where you're iterating over a local Scala
collection of dates and invoking a Spark operation for each. Simply
write dateList.par.map(...) to make the local map proceed in
parallel. It should invoke the
Again this is probably not the place for CDH-specific questions, and
this one is already answered at
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-3-0-container-cannot-be-fetched-because-of/m-p/23497#M478
On Fri, Jan 9, 2015 at 9:23 AM, Mukesh Jha me.mukesh@gmail.com
The repartitioning has some overhead. Do your results show that
perhaps this is what is taking the extra time? You can try reading the
source file with more partitions instead of partitioning it after the
fact. See the minPartitions argument to loadLibSVMFile.
But before that, I'd assess whether
Thanks, I noticed this after posting. I'll try that.
I also think that perhaps Clojure might be creating more classes than the
equivalent Java would, so I'll nudge it a bit higher.
On 9 January 2015 at 11:45, Sean Owen so...@cloudera.com wrote:
It's normal for PermGen to be a bit more of an
Hi Michael,
I have got the directory based column support working at least in a trial. I
have put the trial code here - DirIndexParquet.scala
https://github.com/MickDavies/spark-parquet-dirindex/blob/master/src/main/scala/org/apache/spark/sql/parquet/DirIndexParquet.scala
it has involved me
Hi guys, I running the following example :
https://github.com/knoldus/Play-Spark-Scala in the same machine as the
spark master, and the spark cluster was lauched with ec2 script.
I'm stuck with this errors, any idea how to fix it?
Regards
Eduardo
call the play app prints the following
I am writing my first project in spark. It is the implementation of the
Space Saving counting algorithm.
I am trying to understand how tasks are executed in partitions.
As you can see from my code the algorithms keeps in memory only a small
amount of words for example 100. The top-k ones not all
So I had a Spark job with various failures, and I decided to kill it and
start again. I clicked the 'kill' link in the web console, restarted the
job on the command line and headed back to the web console and refreshed to
see how my job was doing... the URL at the time was:
(FWIW yes I think this should certainly be a POST. The link can become
a miniature form to achieve this and then the endpoint just needs to
accept POST only. You should propose a pull request.)
On Fri, Jan 9, 2015 at 12:51 PM, Joe Wass jw...@crossref.org wrote:
So I had a Spark job with various
Please find the attached worker log.
I could see stream closed exception
On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote:
Could you attach the executor log? That may help identify the root
cause. -Xiangrui
On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch
Hello,
I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:
setConf(spark.sql.parquet.compression.codec, gzip)
the size of the generated files is the always the same, so it seems like
We are able to resolve *SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up* as well. Spark-jobserver working fine
now and need to experiment more.
Thank you guys.
--
View this message in context:
Hi,
As you said, the --executor-cores will define the max number of tasks that
an executor can take simultaneously. So, if you claim 10 cores, it is not
possible to launch more than 10 tasks in an executor at the same time.
According to my experience, set cores more than physical CPU core will
Hi Spark community,
I have a problem with zipping two RDDs of the same size and same number
of partitions.
The error message says that zipping is only allowed on RDDs which are
partitioned into chunks of exactly the same sizes.
How can I assure this? My workaround at the moment is to repartition
Awesome, it actually seems to work. Amazing how simple it can be
sometimes...
Thanks Sean!
On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen so...@cloudera.com wrote:
You can parallelize on the driver side. The way to do it is almost
exactly what you have here, where you're iterating over a local
It's normal for PermGen to be a bit more of an issue with Spark than
for other JVM-based applications. You should simply increase the
PermGen size, which I don't see in your command. -XX:MaxPermSize=256m
allows it to grow to 256m for example. The right size depends on your
total heap size and app.
thanks for the suggestion -- however, looks like this is even slower. With
the small data set I'm using, my aggregate function takes ~ 9 seconds and
the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
the Kyro serializer -- I get the error:
I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW
I'm using the Flambo Clojure wrapper which uses the Java API but I don't
think that should make any difference. I'm running with the following
command:
spark/bin/spark-submit --class mything.core --name My Thing --conf
Hey,
Lets say we have multiple independent jobs that each transform some data
and store in distinct hdfs locations, is there a nice way to run them in
parallel? See the following pseudo code snippet:
dateList.map(date =
sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))
It's unfortunate
Hi Tim,
Thanks for your response.
The benchmark I used just reads data in from HDFS and builds the Linear
Regression model using methods from the MLlib.
Unfortunately, for various reasons, I can't open the source code for the
benchmark at this time.
I will try to replicate the problem using
Hey Nathan,
Thanks for sharing, this is a very interesting post :) My comments are
inlined below.
Cheng
On 1/7/15 11:53 AM, Nathan McCarthy wrote:
Hi,
I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala
via rdd.mapPartitions(…). Using the latest release 1.2.0.
Simple
Hi,
Is there a way of automatically cleaning up the spark.local.dir after a job has
been run? I have noticed a large number of temporary files have been stored
here and are not cleaned up. The only solution I can think of is to run some
sort of cron job to delete files older than a few days. I
81 matches
Mail list logo