Try spark.yarn.user.classpath.first (see
https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN).
Also thread at
http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html.
HTH,
Markus
On 02/03/2015 11:20 PM, Corey Nolet wrote:
Corey,
Which version of Spark do you use? I am using Spark 1.2.0, and guava 15.0.
It seems fine.
Best,
Bo
On Tue, Feb 3, 2015 at 8:56 PM, M. Dale medal...@yahoo.com.invalid wrote:
Try spark.yarn.user.classpath.first (see
https://issues.apache.org/jira/browse/SPARK-2996 - only works for
I'm having a really bad dependency conflict right now with Guava versions
between my Spark application in Yarn and (I believe) Hadoop's version.
The problem is, my driver has the version of Guava which my application is
expecting (15.0) while it appears the Spark executors that are working on
my
Hi,
I've been trying to use HiveContext(instead of SQLContext) in my SparkSQL
application and when I run the application simultaneously, it only works on
the first call and every other call throws the following error-
ERROR Datastore.Schema: Failed initialising database.
Failed to start
I have a cluster which running CDH5.1.0 with Spark component.
Because the default version of Spark from CDH5.1.0 is 1.0.0 while I want to
use some features of Spark 1.2.0, I compiled another Spark with Maven.
But when I run into Spark-shell and created a new SparkContext, I met the
below error:
Use SparkContext#union[T](rdds: Seq[RDD[T]])
On Tue, Feb 3, 2015 at 7:43 PM, Thomas Kwan thomas.k...@manage.com wrote:
I am trying to combine multiple RDDs into 1 RDD, and I am using the union
function. I wonder if anyone has seen StackOverflowError as follows:
Exception in thread main
Spark Doesn't support it, but this connector is open source, you can get it
from github.
The difference between these two DB is depending on what type of solution
you are looking for. Please refer this link :
http://blog.nahurst.com/visual-guide-to-nosql-systems
FYI, from the list of NOSQL in
I am trying to combine multiple RDDs into 1 RDD, and I am using the union
function. I wonder if anyone has seen StackOverflowError as follows:
Exception in thread main java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at
Hi Sean,
I'm interested in trying something similar. How was your performance when you
had many concurrent queries running against spark? I know this will work well
where you have a low volume of queries against a large dataset, but am
concerned about having a high volume of queries against
Hi Ningjun,
I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely
for development purposes). I had most recently installed them utilizing
Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy
thread concerning the null\bin\winutils issue is addressed
I have 3 text files in hdfs which I am reading using spark sql and
registering them as table. After that I am doing almost 5-6 operations -
including joins , group by etc.. And this whole process is taking hardly 6-7
secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of
Hello Akhil,
Thank you for taking your time for a detailed answer. I managed to solve it
in a very similar manner.
Kind regards,
Emre Sevinç
On Mon, Feb 2, 2015 at 8:22 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Emre,
This is how you do that in scala:
val lines =
Already come up several times today:
https://issues.apache.org/jira/browse/SPARK-5557
On Tue, Feb 3, 2015 at 8:04 AM, Night Wolf nightwolf...@gmail.com wrote:
Hi,
I just built Spark 1.3 master using maven via make-distribution.sh;
./make-distribution.sh --name mapr3 --skip-java-test --tgz
Hi Everyone,
Is LogisticRegressionWithSGD in MLlib scalable?
If so, what is the idea behind the scalable implementation?
Thanks in advance,
Peng
-
Peng Zhang
--
View this message in context:
Hi all,
I'm trying to run the master version of spark in order to test some alpha
components in ml package.
I follow the build spark documentation and build it with :
$ mvn clean package
The build is successful but when I try to run spark-shell I got the
following errror :
*Exception in
Hi,
Anyone has implemented the default Pig Loader in Spark? (loading delimited
text files with .pig_schema)
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
Hi All,
I have a requirement where I need to consume messages from ActiveMQ and do
live stream processing as well as batch processing using Spark. Is there a
spark-plugin or library that can enable this? If not, then do you know any
other way this could be done?
Regards
Mohit
Nitin,
Suing Spark is not going to help. Perhaps you should sue someone else :-) Just
kidding!
Mohammed
-Original Message-
From: nitinkak001 [mailto:nitinkak...@gmail.com]
Sent: Tuesday, February 3, 2015 1:57 PM
To: user@spark.apache.org
Subject: Re: Sort based shuffle not working
Hm, I don't think the sort partitioner is going to cause the result to
be ordered by c1,c2 if you only partitioned on c1. I mean, it's not
even guaranteed that the type of c2 has an ordering, right?
On Tue, Feb 3, 2015 at 3:38 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am trying to implement
Hi Gen
Thanks for your feedback. We do have a business reason to run spark on windows.
We have an existing application that is built on C# .NET running on windows. We
are considering adding spark to the application for parallel processing of
large data. We want spark to run on windows so it
You could also just push the data to Amazon S3, which would un-link the
size of the cluster needed to process the data from the size of the data.
DR
On 02/03/2015 11:43 AM, Joe Wass wrote:
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS
Hi,
After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine. My goal is to push aggregated data (to Cassandra
or other low-latency data storage) and then be able to project the results
on a web page (web service). New data will be added (aggregated) once a
The version I'm using was already pre-built for Hadoop 2.3.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21485.html
Sent from the Apache Spark User List mailing list archive at
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS somehow.
I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
If I want to process 800 GB of data (assuming
Hi,
I am using Spark 0.9.1 and I am looking for a proper viz tools that
supports that specific version. As far as I have seen all relevant tools
(e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
mentions about older versions of Spark. Any ideas or suggestions?
*//
this is more of a scala question, so probably next time you'd like to
address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala
val optArrStr:Option[Array[String]] = ???
optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or
whatever default value you have for this.
I have a RDD which is of type
org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))]
I want to write it as a csv file.
Please suggest how this can be done.
myrdd.map(line = (line._1 + , + line._2._1.mkString(,) + , +
line._2._2.mkString(','))).saveAsTextFile(hdfs://...)
Thanks Gerard !!
This is working.
On Tue, Feb 3, 2015 at 6:44 PM, Gerard Maas gerard.m...@gmail.com wrote:
this is more of a scala question, so probably next time you'd like to
address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala
val optArrStr:Option[Array[String]] =
Hello Adamantios,
Thanks for the poke and the interest.
Actually, you're the second asking about backporting it. Yesterday (late),
I created a branch for it... and the simple local spark test worked! \o/.
However, it'll be the 'old' UI :-/. Since I didn't ported the code using
play 2.2.6 to the
Hi,
I just built Spark 1.3 master using maven via make-distribution.sh;
./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive
-Phive-thriftserver -Phive-0.12.0
When trying to start the standalone spark master on a cluster I get the
following stack trace;
15/02/04 08:53:56
Yes, I see this too. I think the Jetty shading still needs a tweak.
It's not finding the servlet API classes. Let's converge on SPARK-5557
to discuss.
On Tue, Feb 3, 2015 at 2:04 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I'm trying to run the master version of spark in order to
I think this is a separate issue with how the EdgeRDDImpl partitions
edges. If you can merge this change in and rebuild, it should work:
https://github.com/apache/spark/pull/4136/files
If you can't, I just called the Graph.partitonBy() method right after
construction my graph but before
The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very
We use S3 as a main storage for all our input data and our generated
(output) data. (10's of terabytes of data daily.) We read gzipped data
directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as
long as you parallelize the work well by distributing the processing
across enough
Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to
s3.
The upcoming hadoop 2.7.0 contains some bug fixes for s3a.
FYI
On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch dar...@darose.net
wrote:
We use S3 as a main storage for all our input data and our generated
Hi,
Any thoughts ?
Thanks,
On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
SchemaRDD has schema with decimal columns created like
x1 = new StructField(a, DecimalType(14,4), true)
x2 = new StructField(b, DecimalType(14,4), true)
Registering as SQL
I have about 500 MB of data and I'm trying to process it on a single
`local` instance. I'm getting an Out of Memory exception. Stack trace at
the end.
Spark 1.1.1
My JVM has --Xmx2g
spark.driver.memory = 1000M
spark.executor.memory = 1000M
spark.kryoserializer.buffer.mb = 256
We have gone down a similar path at Webtrends, Spark has worked amazingly well
for us in this use case. Our solution goes from REST, directly into spark, and
back out to the UI instantly.
Here is the resulting product in case you are curious (and please pardon the
self promotion):
Thanks very much, that's good to know, I'll certainly give it a look.
Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?
Joe
On 3 February 2015 at 17:48, David
Hi Folks,
I'm new to GraphX and Scala and my sendMsg function needs to index into an
input list to my algorithm based on the pregel()() iteration number, but I
don't see a way to access that. I see in
Write out the rdd to a cassandra table. The datastax driver provides
saveToCassandra() for this purpose.
On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais
adamantios.cor...@gmail.com wrote:
Hi,
After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine.
I'll add i usually just do
println(query.queryExecution)
On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust mich...@databricks.com
wrote:
You should be able to do something like:
sbt -Dscala.repl.maxprintstring=64000 hive/console
Here's an overview of catalyst:
Adamantios,
As said, I backported it to 0.9.x and now it's pushed on this branch:
https://github.com/andypetrella/spark-notebook/tree/spark-0.9.x.
I didn't created dist atm, because I'd prefer to do it only if necessary
:-).
So, if you want to try it out, just clone the repo, checked out in this
I don't think its possible to access. What I've done before is send the
current or next iteration index with the message, where the message is a
case class.
HTH
Dan
On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu
wrote:
Hi Folks,
I'm new to GraphX and Scala and my
You should be able to do something like:
sbt -Dscala.repl.maxprintstring=64000 hive/console
Here's an overview of catalyst:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2
On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies
Not all of our input files are zipped. The ones that are obviously are
not parallelized - they're just processed by a single task. Not a big
issue for us, though, as the those zipped files aren't too big.
DR
On 02/03/2015 01:08 PM, Joe Wass wrote:
Thanks very much, that's good to know,
This is an exerpt from the Design document of the implementation of Sort
based shuffle.. I am thinking I might be wrong in my perception of sort
based shuffle. Dont completely understand it though.
*Motivation*
A sortbased shuffle can be more scalable than Spark’s current hashbased
one because
To be clear, there is no distinction between partitions and blocks for RDD
caching (each RDD partition corresponds to 1 cache block). The distinction
is important for shuffling, where by definition N partitions are shuffled
into M partitions, creating N*M intermediate blocks. Each of these blocks
Thank you!
This is very helpful.
-Mike
From: Aaron Davidson ilike...@gmail.com
To: Imran Rashid iras...@cloudera.com
Cc: Michael Albert m_albert...@yahoo.com; Sean Owen so...@cloudera.com;
user@spark.apache.org user@spark.apache.org
Sent: Tuesday, February 3, 2015 6:13 PM
Subject: Re:
Thanks for the explanations, makes sense. For the record looks like this
was worked on a while back (and maybe the work is even close to a solution?)
https://issues.apache.org/jira/browse/SPARK-1476
and perhaps an independent solution was worked on here?
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
ordering of c2 type is the problem here.
On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen so...@cloudera.com wrote:
Hm, I don't think the sort
cc dev list
How are you saving the data? There are two relevant 2GB limits:
1. Caching
2. Shuffle
For caching, a partition is turned into a single block.
For shuffle, each map partition is partitioned into R blocks, where R =
number of reduce tasks. It is unlikely a shuffle block 2G,
Michael,
you are right, there is definitely some limit at 2GB. Here is a trivial
example to demonstrate it:
import org.apache.spark.storage.StorageLevel
val d = sc.parallelize(1 to 1e6.toInt, 1).map{i = new
Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY)
d.count()
It gives the same
Hi,
I want to increase the maxPrintString the Spark repl to look at SQL query
plans, as they are truncated by default at 800 chars, but don't know how to
set this. You don't seem to be able to do it in the same way as you would
with with Scala repl.
Anyone know how to set this?
Also anyone
Hi,
To keep processing the older file also you can use fileStream instead of
textFileStream. It has a parameter to specify to look for already present
files.
For deleting the processed files one way is to get the list of all files in
the dStream. This can be done by using the foreachRDD api of
Greetings!
Thanks for the response.
Below is an example of the exception I saw.I'd rather not post code at the
moment, so I realize it is completely unreasonable to ask for a
diagnosis.However, I will say that adding a partitionBy() was the last change
before this error was created.
Thanks for
I am not sure that this way can help you. There is my situation that I can
not see any input in terminal after some work gets done via spark-shell, I
used to run a command stty echo , and It fixed.
Best,
Amoners
--
View this message in context:
I want to write whole schemardd to single in hdfs but facing following
exception
rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder
DFSClient_NONMAPREDUCE_-564238432_57
In case anyone needs to merge all of their part-n files (small result
set only) into a single *.csv file or needs to generically flatten case
classes, tuples, etc., into comma separated values:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
On Tue Feb 03 2015 at 8:23:59 AM
Greetings!
First, my sincere thanks to all who have given me advice.Following previous
discussion, I've rearranged my code to try to keep the partitions to more
manageable sizes.Thanks to all who commented.
At the moment, the input set I'm trying to work with is about 90GB (avro
parquet
Hey Joe,
With the ephemeral HDFS, you get the instance store of your worker nodes.
For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
very fast.
For the persistent HDFS, you get whatever EBS volumes the launch script
configured. EBS volumes are always network drives, so
I am trying to implement secondary sort in spark as we do in map-reduce.
Here is my data(tab separated, without c1, c2, c2).
c1c2 c3
1 2 4
1 3 6
2 4 7
2 6 8
3 5 5
3 1 8
3 2 0
To do secondary sort, I
Just to add, I am suing Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.
On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote:
Write out the rdd to a cassandra table. The
That is fairly out of date (we used to run some of our jobs on it ... But
that is forked off 1.1 actually).
Regards
Mridul
On Tuesday, February 3, 2015, Imran Rashid iras...@cloudera.com wrote:
Thanks for the explanations, makes sense. For the record looks like this
was worked on a while
65 matches
Mail list logo