Hi Gianluca,
I believe your cluster setup wasn't complete. Do check the ec2 script
console for more details. Also micro instances will be having only 600mb
memory.
Thanks
Best Regards
On Tue, Jun 3, 2014 at 1:59 AM, Gianluca Privitera
gianluca.privite...@studio.unibo.it wrote:
Hi everyone,
I am not sure. I have just been using some numerical datasets.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6784.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
import pyspark
That is because people usually don't package python files into their jars.
For pyspark, however, this will work as long as the jar can be opened and
its contents can
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui
On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991
praveshjain1...@gmail.com wrote:
I am not sure. I have just been using some numerical datasets.
--
View this message in context:
- HI all,
- Application running and completed count does not get updated, it is
always zero. I have ran
- SparkPi application at least 10 times. please help
-
- *Workers:* 3
- *Cores:* 24 Total, 0 Used
- *Memory:* 43.7 GB Total, 0.0 B Used
- *Applications:* 0 Running, 0
I need to partition my data into the same weighted partitions, suppose I have
20GB data and I want 4 partitions where each partition has 5GB of the data.
Thanks
--
View this message in context:
Your applications are probably not connecting to your existing cluster and
instead running in local mode. Are you passing the master URL to the
SparkPi application?
Andrew
On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote:
- HI all,
- Application running and
Thanks for your reply Andrew. I am running applications directly on the
master node. My cluster also contain three worker nodes, all are visible
on WebUI.
Spark Master at spark://sanjar-local-machine-1:7077
- *URL:* spark://sanjar-local-machine-1:7077
- *Workers:* 3
- *Cores:* 24 Total,
As Andrew said, your application is running on Standalone mode. You need
to pass
MASTER=spark://sanjar-local-machine-1:7077
before running your sparkPi example.
Thanks
Best Regards
On Tue, Jun 3, 2014 at 1:12 PM, MrAsanjar . afsan...@gmail.com wrote:
Thanks for your reply Andrew. I am
thanks guys, that fixed my problem. As you might have noticed, I am VERY
new to spark. Building a spark cluster using LXC has been a challenge.
On Tue, Jun 3, 2014 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
As Andrew said, your application is running on Standalone mode. You need
Ah, the output directory check was just not executed in the past. I
thought it deleted the files. A third way indeed.
FWIW I also think (B) is best. (A) and (C) both have their risks, but
if they're non-default and everyone's willing to entertain a new arg
to the API method, sure. (A) seems more
Hi All,
I have created a spark cluster on EC2 using spark-ec2 script. Whenever more
data is there to be processed I would like to add new slaves to existing
cluster and would like to remove slave node when the data to be processed is
low.
It seems currently spark-ec2 doesn't have option to
Hi,
My Spark installations (both 0.9.1 and 1.0.0) starts up extremely slow when
starting a simple Spark Streaming job.
I have to wait 6 (!) minutes at
INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block
manager
stage and another 4 (!) minutes at
INFO util.MetadataCleaner:
I tried to use Kryo as a serialiser isn spark streaming, did everything
according to the guide posted on the spark website, i.e. added the following
lines:
conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer);
conf.set(spark.kryo.registrator, MyKryoRegistrator);
I also added
Thanks for that, Matei! I'll look at that once I get a spare moment. :-)
If you like, I'll keep documenting my newbie problems and frustrations...
perhaps it might make things easier for others.
Another issue I seem to have found (now that I can get small clusters up):
some of the examples (the
You might want to look at another great plugin : “sbt-pack”
https://github.com/xerial/sbt-pack.
It collects all the dependencies JARs and creates launch scripts for *nix
(including Mac OS) and windows.
HTH
Pierre
On 02 Jun 2014, at 17:29, Andrei faithlessfri...@gmail.com wrote:
Thanks!
I am using the following code segment :
countPerWindow.foreachRDD(new FunctionJavaPairRDDlt;String, Long, Void()
{
@Override
public Void call(JavaPairRDDString, Long rdd) throws Exception
{
ComparatorTuple2lt;String,Long comp = new
HI All,
Is it possible to run a standalone app that would compute and persist/cache
an RDD and then run other standalone apps that would gain access to that
RDD?
--
Thank you,
Oleg
Sorry if I'm dense but is OptimisingSort your class? it's saying you
have included something from it in a function that is shipped off to
remote workers but something in it is not java.io.Serializable.
OptimisingSort$6$1 needs to be Serializable.
On Tue, Jun 3, 2014 at 2:23 PM, nilmish
Hi
Is there any way to prepare spark executor? Like what we do in MapReduce, we
implements a setup and a clearup method.
For my case, I need this prepare method to init StaticParser base on the
env(dev, production). Then, I can directly use this StaticParser on
executor. like this
object
I don't think that's supported by default as when the standalone context
will close, the related RDDs will be GC'ed
You should explore Spark-Job Server, which allows to cache RDDs by name and
reuse them within a context.
https://github.com/ooyala/spark-jobserver
-kr, Gerard.
On Tue, Jun 3,
I set up Spark-0.9.1 to run on mesos-0.13.0 using the steps mentioned here
https://spark.apache.org/docs/0.9.1/running-on-mesos.html . The Mesos UI
is showing two workers registered. I want to run these commands on
Spark-shell
scala val data = 1 to 1 data:
Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads
the properties file from SPARK_HOME/conf/spark-defauls.conf ?
IMO it would be more natural to override what is defined in SPARK_HOME/conf
by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by
command line
1. Make sure your spark-*.tgz that you created by make_distribution.sh is
accessible by all the slaves nodes.
2. Check the worker node logs.
Thanks
Best Regards
On Tue, Jun 3, 2014 at 8:13 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
I set up Spark-0.9.1 to run on mesos-0.13.0
Hi All,
there is information in 1.0.0 Spark's documentation that
there is an option --cores that one can use to set the number of cores
that spark-shell uses on the cluster:
You can also pass an option --cores numCores to control the number of
cores that spark-shell uses on the cluster.
This
I havent been able to set the cores with that option in Spark 1.0.0 either.
To work around that, setting the environment variable:
SPARK_JAVA_OPTS=-Dspark.cores.max=numCores seems to do the trick.
Matt Kielo
Data Scientist
Oculus Info Inc.
On Tue, Jun 3, 2014 at 11:15 AM, Marek Wiewiorka
That used to work with version 0.9.1 and earlier and does not seem to work
with 1.0.0.
M.
2014-06-03 17:53 GMT+02:00 Mikhail Strebkov streb...@gmail.com:
Try -c numCores instead, works for me, e.g.
bin/spark-shell -c 88
On Tue, Jun 3, 2014 at 8:15 AM, Marek Wiewiorka
Ah, this is a bug that was fixed in 1.0.
I think you should be able to workaround it by using a fake class tag:
scala.reflect.ClassTag$.MODULE$.AnyRef()
On Mon, Jun 2, 2014 at 8:22 PM, bluejoe2008 bluejoe2...@gmail.com wrote:
spark 0.9.1
textInput is a JavaRDD object
i am programming in
Hi Suela,
(Please subscribe our user mailing list and send your questions there
in the future.) For your case, each file contains a column of numbers.
So you can use `sc.textFile` to read them first, zip them together,
and then create labeled points:
val xx = sc.textFile(/path/to/ex2x.dat).map(x
Hi All,
I'm trying to run my code that used to work with mesos-0.14 and spark-0.9.0
with mesos-0.18.2 and spark-1.0.0. and I'm getting a weird error when I use
coarse mode (see below).
If I use the fine-grained mode everything is ok.
Has anybody of you experienced a similar error?
more stderr
Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but
interface was expected is the classic error meaning you compiled
against Hadoop 1, but are running against Hadoop 2
I think you need to override the hadoop-client artifact that Spark
depends on to be a Hadoop 2.x version.
On Tue,
Hi
Set up project under Eclipse using Maven:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0/version
/dependency
Simple example fails:
def main(args: Array[String]): Unit = {
Wow! What a quick reply!
adding
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
/dependency
solved the problem.
But now I get
14/06/03 19:52:50 ERROR Shell: Failed to locate
Hi,
I have noticed that upon launching a cluster consisting of r3.8xlarge
high-memory instances the standard /mnt /mnt2 /mnt3 /mnt4 temporary
directories get created and set up for temp usage, however they will point
to the root 8Gb filesystem.
The 2x320GB SSD-s are not mounted and also they are
Hi Andrew,
Thanks for your answer.
The reason of the question: I've been trying to contribute to the community
by helping answering Spark-related questions on Stack Overflow.
(note on that: Given the growing volume on the user list lately, I think it
will need to scale out to other venues, so
You can set an arbitrary properties file by adding --properties-file
argument to spark-submit. It would be nice to have spark-submit also
look in SPARK_CONF_DIR as well by default. If you opened a JIRA for
that I'm sure someone would pick it up.
On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have HADOOP_HOME or
similar set.
http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path
On Tue, Jun 3, 2014 at 5:54 PM, toivoa
Hi,
I've installed Spark 1.0.0 on a HDP 2.1
I moved the hive-site.xml file into the conf directory for Spark in an
attempt to connect Spark with my existing Hive.
Below is the full log from me starting Spark till I get the error. It seems
to be building the assembly with hive so that part
It was not intended to be experimental as this improves general
performance. We tested the feature since 0.9, and didnt see any problems.
We need to investigate the cause of this. Can you give us the logs showing
this error so that we can analyze this.
TD
On Tue, Jun 3, 2014 at 10:08 AM,
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs
just fine without them.
Matei
On Jun 3, 2014, at 10:33 AM, Sean Owen so...@cloudera.com wrote:
I'd try the internet / SO first -- these are actually generic
Hadoop-related issues. Here I think you don't have
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9 to
1.0 - it seems that saveAsTestFile function is throwing
java.lang.UnsupportedOperationException that I have never seen before.
Any hints appreciated.
scheduler.TaskSetManager: Loss was due to
Have you tried re-compiling your job against the 1.0 release?
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9
to 1.0 - it seems that saveAsTestFile function is throwing
I think I know what is going on! This probably a race condition in the
DAGScheduler. I have added a JIRA for this. The fix is not trivial though.
https://issues.apache.org/jira/browse/SPARK-2002
A not-so-good workaround for now would be not use coalesced RDD, which is
avoids the race condition.
Hello Lars,
Can you check the value of hive.security.authenticator.manager in
hive-site.xml? I guess the value is
org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator. This class was
introduced in hive 0.13, but Spark SQL is based on hive 0.12 right now. Can
you change the value of
I'm trying to save an RDD as a parquet file through the saveAsParquestFile()
api,
With code that looks something like:
val sc = ...
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val someRDD: RDD[SomeCaseClass] = ...
someRDD.saveAsParquetFile(someRDD.parquet)
This thread seems to be about the same issue:
https://www.mail-archive.com/user@spark.apache.org/msg04403.html
On Tue, Jun 3, 2014 at 12:25 PM, k.tham kevins...@gmail.com wrote:
I'm trying to save an RDD as a parquet file through the
saveAsParquestFile()
api,
With code that looks something
Hmm that sounds like it could be done in a custom OutputFormat, but I'm not
familiar enough with custom OutputFormats to say that's the right thing to
do.
On Tue, Jun 3, 2014 at 10:23 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi Andrew,
Thanks for your answer.
The reason of the
Oh, I missed that thread. Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-s-saveAsParquetFile-throws-java-lang-IncompatibleClassChangeError-tp6837p6839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
So are you using Java 7 or 8.
7 doesnt clean closures properly. So you need to define a static class as a
function then call that in your operations. Else it'll try to send the
whole class along with the function.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Thanks a lot,
That worked great!
Thanks,
Lars
On Tue, Jun 3, 2014 at 12:17 PM, Yin Huai huaiyin@gmail.com wrote:
Hello Lars,
Can you check the value of hive.security.authenticator.manager in
hive-site.xml? I guess the value is
You'll have to restart the cluster.. create copy of your existing slave..
add it to slave files in master restart the cluster
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 3, 2014 at 4:30 PM, Sirisha Devineni
Did you use docker or plain lxc specifically?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 3, 2014 at 1:40 PM, MrAsanjar . afsan...@gmail.com wrote:
thanks guys, that fixed my problem. As you might have
yes, I have - I compiled both Spark and my soft from sources - actually the
whole processing is executing fine - just saving results is failing.
2014-06-03 21:01 GMT+02:00 Gerard Maas gerard.m...@gmail.com:
Have you tried re-compiling your job against the 1.0 release?
On Tue, Jun 3, 2014
I've read through that thread, and it seems for him, he needed to add a
particular hadoop-client dependency.
However, I don't think I should be required to do that as I'm not reading
from HDFS.
I'm just running a straight up minimal example, in local mode, and out of
the box.
Here's an example
All of that support code uses Hadoop-related classes, like
OutputFormat, to do the writing to Parquet format. There's a Hadoop
code dependency in play here even if the bytes aren't going to HDFS.
On Tue, Jun 3, 2014 at 10:10 PM, k.tham kevins...@gmail.com wrote:
I've read through that thread,
Hey Amit,
You might want to check out PairRDDFunctions
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions.
For your use case in particular, you can load the file as a RDD[(String,
String)] and then use the groupByKey() function in PairRDDFunctions to
On Tue, Jun 3, 2014 at 6:52 AM, sirisha_devineni
sirisha_devin...@persistent.co.in wrote:
Did you open a JIRA ticket for this feature to be implemented in spark-ec2?
If so can you please point me to the ticket?
Just created it: https://issues.apache.org/jira/browse/SPARK-2008
Nick
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used
to thinking of closure cleaner as something Spark built. Do you have
somewhere I can read more about this?
On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
So are you using Java 7 or 8.
7
Ok, it's a bug in spark. I've submitted a patch:
https://issues.apache.org/jira/browse/SPARK-2009
On Mon, Jun 2, 2014 at 8:39 PM, Vadim Chekan kot.bege...@gmail.com wrote:
Thanks for looking into this Tathagata.
Are you looking for traces of ReceiveInputDStream.clearMetadata call?
Here is
Hi Tathagata,
Thanks for your help! By not using coalesced RDD, do you mean not
repartitioning my Dstream?
Thanks,
Mike
On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I think I know what is going on! This probably a race condition in the
DAGScheduler. I
I am not sure what DStream operations you are using, but some operation is
internally creating CoalescedRDDs. That is causing the race condition. I
might be able help if you can tell me what DStream operations you are using.
TD
On Tue, Jun 3, 2014 at 4:54 PM, Michael Chang m...@tellapart.com
Лучше по частям собрать.
http://www.newegg.com/Product/Product.aspx?Item=N82E16813157497
Пассивное охлаждение, 16Гб памяти можно поставить. А на то что ты прислал
4Гб максимум, это не годиться.
Выбрать малый корпус и дело с концом.
On Tue, Jun 3, 2014 at 4:35 PM, Vadim Chekan
Hi all,
I get the following exception when using Spark to run example k-means
program. I am using Spark 1.0.0 and running the program locally.
java.io.InvalidClassException: scala.Tuple2; invalid descriptor for field _1
at
I have created some extension methods for RDDs in RichRecordRDD and these
are working exceptionally well for me.
However, when looking at the logs, its impossible to tell what's going on
because all the line number hints point to RichRecordRDD.scala rather than
the code that uses it. For example:
You can use RDD.setName to give it a name. There’s also a creationSite field
that is private[spark] — we may want to add a public setter for that later. If
the name isn’t enough and you’d like this, please open a JIRA issue for it.
Matei
On Jun 3, 2014, at 5:22 PM, John Salvatier
A better way seems to be to use ClassTag$.apply(Class).
I'm going by memory since I'm on my phone, but I just did that today.
Gino B.
On Jun 3, 2014, at 11:04 AM, Michael Armbrust mich...@databricks.com wrote:
Ah, this is a bug that was fixed in 1.0.
I think you should be able to
When I run spark in cloudera of CDH5 with service spark-master start
command,it turns out that Spark master is dead and pid file exists,What can
I do to solve the problem?
--
View this message in context:
Hi,
I guess it should be possible to dig through the scripts
bin/spark-shell, bin/spark-submit etc. and convert them to a long sbt
command that you can run. I just tried
sbt run-main org.apache.spark.deploy.SparkSubmit spark-shell
--class org.apache.spark.repl.Main
but that fails with
Failed
What Java version do you have, and how did you get Spark (did you build it
yourself by any chance or download a pre-built one)? If you build Spark
yourself you need to do it with Java 6 — it’s a known issue because of the way
Java 6 and 7 package JAR files. But I haven’t seen it result in this
Ah, sorry to hear you had more problems. Some thoughts on them:
Thanks for that, Matei! I'll look at that once I get a spare moment. :-)
If you like, I'll keep documenting my newbie problems and frustrations...
perhaps it might make things easier for others.
Another issue I seem to have
You can copy your configuration from the old one. I’d suggest just downloading
it to a different location on each node first for testing, then you can delete
the old one if things work.
On Jun 3, 2014, at 12:38 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
Hi ,
I am currently using
I don't quite get it..
mapPartitionWithIndex takes a function that maps an integer index and an
iterator to another iterator. How does that help with retrieving the hdfs
file name?
I am obviously missing some context..
Thanks.
On May 30, 2014 1:28 AM, Aaron Davidson ilike...@gmail.com wrote:
Hi,
I want to know how I can stop a running SparkContext in a proper way so that
next time when I start a new SparkContext, the web UI can be launched on the
same port 4040.Now when i quit the job using ctrl+z the new sc are launched in
new ports.
I have the same problem with ipython
Look in the directory /var/run/spark to see if a spark-master.pid file is
left over from a crashed master, and remove it.
-
--
Theodore Wong lt;t...@tmwong.orggt;
www.tmwong.org
--
View this message in context:
when i called KMeans.train(), an error happened:
14/06/04 13:02:29 INFO scheduler.DAGScheduler: Submitting Stage 3
(MappedRDD[12] at map at KMeans.scala:123), which has no missing parents
14/06/04 13:02:29 INFO scheduler.DAGScheduler: Failed to run takeSample at
KMeans.scala:260
Exception in
Did you try sc.stop()?
On Tue, Jun 3, 2014 at 9:54 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
Hi,
I want to know how I can stop a running SparkContext in a proper way so that
next time when I start a new SparkContext, the web UI can be launched on the
same port 4040.Now when i quit the
76 matches
Mail list logo