Can you paste the stderr from the worker logs? (Found in work/
app-20140625133031-0002/ directory)
Most likely you might need to set SPARK_MASTER_IP in your spark-env.sh file
(Not sure why i'm seeing akka.tcp://spark@localhost:56569 instead of
akka.tcp://spark@*serverip*:56569)
Thanks
Best
Try deleting the .iv2 directory in your home and then do a sbt clean
assembly would solve this issue i guess.
Thanks
Best Regards
On Thu, Jun 26, 2014 at 3:10 AM, Robert James srobertja...@gmail.com
wrote:
In case anyone else is having this problem, deleting all ivy's cache,
then doing a sbt
You cannot read image files with wholeTextFiles because it uses
CombineFileInputFormat which cannot read gripped files because they are not
splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source
proving it):
override def createRecordReader(
split: InputSplit,
That's okay, but hadoop has ES integration. what happened if I run
saveAsHadoopFile without hadoop (or I must need to pull up hadoop
programatically? (if I can))
b0c1
Hi Shannon,
It should be a configuration issue, check in your /etc/hosts and make sure
localhost is not associated with the SPARK_MASTER_IP you provided.
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn squ...@gatech.edu wrote:
Hi all,
I have a 2-machine Spark network
You can just add elasticsearch-hadoop as a dependency to your project to
user the ESInputFormat and ESOutputFormat (
https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
here:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
For testing, yes I
If a window is defined without the slideDuration parameter, how will it
slide?
I guess it will use context's batchInterval as the slideDuration?
Thanks for any help.
--
View this message in context:
Hi Gerard,
What is the version of Spark, Hadoop, Cassandra and Calliope are you using.
We never built Calliope to Hadoop2 as we/or our clients don't use Hadoop in
their deployments or use it only as the Infra component for Spark in which
case H1/H2 doesn't make a difference for them.
I know
Hi,
I cannot argue about other use-cases, however MLLib doesn’t support working
with text classification out of the box. There was basic support in MLI (thanks
Sean for correcting me that it is MLI not MLLib), but I don’t know why it is
not developed anymore.
For text classification in
Dataflow is a hosted service and tries to abstract an entire pipeline;
Spark maps to some components in that pipeline and is software. My
first reaction was that Dataflow mapped more to Summingbird, as part
of it is a higher-level system for doing a specific thing in
batch/streaming --
You seem to have the binary for Hadoop 2, since it was compiled
expecting that TaskAttemptContext is an interface. So the error
indicates that Spark is also seeing Hadoop 1 classes somewhere.
On Wed, Jun 25, 2014 at 4:41 PM, Robert James srobertja...@gmail.com wrote:
After upgrading to Spark
Hi,
We have a big SMP server(with 128G RAM and 32 CPU cores) to runn small scale
analytical works, what's the best practice to deploy a stand alone Spark on the
server to achieve good performance.
How many instances should be configured, how many RAM and CPU cores should be
allocated for
I unfortunately haven't seen this directly. But some typical things I try
when debugging are as follows.
Do you see a corresponding error on the other side of that connection
(alpinenode7.alpinenow.local)? Or is that the same machine?
Also, do the driver logs show any longer stack trace and have
Hi all,
not sure if this is a config issue or it's by design, but when I run the
spark shell, and try to submit another application from elsewhere, the
second application waits for the first to finish and outputs the following:
Initial job has not accepted any resources; check your cluster UI to
Hi Jamborta,
You can use the following options in your application to limit the usage of
resources, like
- spark.cores.max
- spark.executor.memory
Its better to use Mesos if you want to run multiple applications on the
same cluster smoothly.
Thanks
Best Regards
On Thu, Jun 26,
Hi all,
I have a newbie question about StorageLevel of spark. I came up with these
sentences in spark documents:
If your RDDs fit comfortably with the default storage level (MEMORY_ONLY),
leave them that way. This is the most CPU-efficient option, allowing operations
on the RDDs to run as
The Apache MRQL team is pleased to announce the release of
Apache MRQL 0.9.2-incubating. This is our second Apache release.
Apache MRQL is a query processing and optimization system for
large-scale distributed data analysis, built on top of
Apache Hadoop, Hama, and Spark.
The release artifacts
Yep, it does.
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:11 PM, jamborta jambo...@gmail.com wrote:
thanks a lot. I have tried restricting the memory usage before, but it
seems
it was the issue with the number of cores available.
I am planning to run this on a yarn cluster, I assume
Hello,
as from object, when I run scala spark-shell on our mesos (0.19) cluster
some spark slaves just hang at the end of the staging phase for any given
elaboration.
The cluster has mixed OSes (Ubuntu 14.04 / Debian 7.4), but if I run the
same shell and commands using coarse grained mode
Yes it does. The idea is to override the dependency if needed. I thought
you mentioned that you had built for Hadoop 2.
On Jun 26, 2014 11:07 AM, Robert James srobertja...@gmail.com wrote:
Yes. As far as I can tell, Spark seems to be including Hadoop 1 via
its transitive dependency:
Hi everyone,
Aaron, thanks for your help so far. I am trying to serialize objects that I
instantiate from a 3rd party library namely instances of com.wcohen.ss.Jaccard,
and com.wcohen.ss.BasicStringWrapper. However, I am having problems with
serialization. I am (at least trying to) using Kryo
The programming guide is part of the standard documentation:
http://spark.apache.org/docs/latest/sql-programming-guide.html
Regarding specifics about SQL syntax and functions, I'd recommend using a
HiveContext and the HQL method currently, as that is much more complete
than the basic SQL parser
Hi Kang,
You raise a good point. Spark does not automatically cache all your RDDs.
Why? Simply because the application may create many RDDs, and not all of
them are to be reused. After all, there is only so much memory available to
each executor, and caching an RDD adds some overhead especially
The current problem with Spark is the big overhead and cost of bringing up
a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
bring up a 30 node cluster. This makes it non-efficient for computations
which may take only 10 - 15 minutes.
Hmm, this is a misleading message.The
On 6/26/14, Sean Owen so...@cloudera.com wrote:
Yes it does. The idea is to override the dependency if needed. I thought
you mentioned that you had built for Hadoop 2.
I'm very confused :-(
I downloaded the Spark distro for Hadoop 2, and installed it on my
machine. But the code doesn't have a
I'm working to set up a calculation that involves calling
mllib's SVMWithSGD.train several thousand times on different permutations
of the data. I'm trying to run the separate jobs using a threadpool to
dispatch the different requests to a spark context connected a Mesos's
cluster, using course
I’ve created a CLI driver for a Spark version of a Mahout job called item
similarity with several tests that all work fine on local[4] Spark standalone.
The code even reads and writes to clustered HDFS. But switching to clustered
Spark has a problem that seems tied to a broadcast and/or
Hello Federico,
is it working with the 1.0 branch? In either branch, make sure that you
have this commit:
https://github.com/apache/spark/commit/1132e472eca1a00c2ce10d2f84e8f0e79a5193d3
I never saw the behavior you are describing, but that commit is important
if you are running in fine-grained
If you want to transform an RDD to a Map, I assume you have an RDD of
pairs. The method collectAsMap() creates a Map from the RDD in this
case.
Do you mean that you want to update a Map object using data in each
RDD? You would use foreachRDD() in that case. Then you can use
RDD.foreach to do
On Thu, Jun 26, 2014 at 1:44 PM, Robert James srobertja...@gmail.com wrote:
I downloaded the Spark distro for Hadoop 2, and installed it on my
machine. But the code doesn't have a reference to that path - it uses
sbt for dependencies. As far as I can tell, using sbt or maven or ivy
will
Thanks, Sean!
I am currently using foreachRDD to update the global map using data in each
RDD. The reason I want to return a map as RDD instead of just updating the
map is that RDD provides many handy methods for output. For example, I want
to save the global map into files in HDFS for each batch
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui free...@adatao.com
wrote:
The overhead of bringing up a AWS Spark spot instances is NOT the
inherent problem of Spark.
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster launch+config times.
Unfortunately, this is a spark support issue, but an AWS one.
My setup ---
I have a private cluster running on 4 nodes. I want to use the spark-submit
script to execute spark applications on the cluster. I am using Mesos to
manage the cluster.
This is the command I ran on local mode, which ran successfully ---
./bin/spark-submit --master local --class
Hmm, I remember a discussion on here about how the way in which spark-ec2
rsyncs stuff to the cluster for setup could be improved, and I’m assuming
there are other such improvements to be made. Perhaps those improvements
don’t matter much when compared to EC2 instance launch times, but I’m not
Thanks. I without local option I can connect with es remote, now I only
have one problem. How can I use elasticsearch-hadoop with spark streaming?
I mean DStream doesn't have saveAsHadoopFiles method, my second problem
the output index is depend by the input data.
Thanks
My *best guess* (please correct me if I'm wrong) is that the master
(machine1) is sending the command to the worker (machine2) with the
localhost argument as-is; that is, machine2 isn't doing any weird
address conversion on its end.
Consequently, I've been focusing on the settings of the
Hi b0c1,
I have an example of how to do this in the repo for my talk as well, the
specific example is at
https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/IndexTweetsLive.scala
. Since DStream doesn't have a saveAsHadoopDataset we use foreachRDD and
Hi all:
I am attempting to execute a simple test of the SparkSQL system capability
of persisting to parquet files...
My code is:
val conf = new SparkConf()
.setMaster( local[1])
.setAppName(test)
implicit val sc = new SparkContext(conf)
val sqlContext = new
Hello all:
I am attempting to persist a parquet file comprised of a SchemaRDD of nested
case classes...
Creating a schemaRDD object seems to work fine, but exception is thrown when
I attempt to persist this object to a parquet file...
my code:
case class Trivial(trivial: String = trivial,
Wow, thanks your fast answer, it's help a lot...
b0c1
--
Skype: boci13, Hangout: boci.b...@gmail.com
On Thu, Jun 26, 2014 at 11:48 PM, Holden Karau
Just your luck I happened to be working on that very talk today :) Let me
know how your experiences with Elasticsearch Spark go :)
On Thu, Jun 26, 2014 at 3:17 PM, boci boci.b...@gmail.com wrote:
Wow, thanks your fast answer, it's help a lot...
b0c1
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1
release.
On Thu, Jun 26, 2014 at 3:03 PM, anthonyjschu...@gmail.com
anthonyjschu...@gmail.com wrote:
Hello all:
I am attempting to persist a parquet file comprised of a SchemaRDD of
nested
case classes...
Creating
I am pretty happy with using pyspark with ipython notebook. The only issue
is that I need to look at the console output or spark ui to track task
progress. I wonder if anyone thought of or better wrote something to
display some progress bars on the same page when I evaluate a cell in ipynb?
I
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE
I suspect they are using thr own builds.. has anybody had a chance to look
at it?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
You can use SparkListener interface to track the tasks.. another is to use
JSON patch (https://github.com/apache/spark/pull/882) track tasks with
json api
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 27, 2014
Thanks. That might be a good note to add to the official Programming
Guide...
On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User
List] ml-node+s1001560n8382...@n3.nabble.com wrote:
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1
release.
On
In the interest of completeness, this is how I invoke spark:
[on master]
sbin/start-all.sh
spark-submit --py-files extra.py main.py
iPhone'd
On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu wrote:
My *best guess* (please correct me if I'm wrong) is that the master
(machine1)
Hi all,
Instead of installing numpy in each worker node, is it possible to
ship numpy (via --py-files option maybe) while invoking the
spark-submit?
Thanks,
Avishek
I don't have specific solutions for you, but the general things to try are:
- Decrease task size by broadcasting any non-trivial objects.
- Increase duration of tasks by making them less fine-grained.
How many tasks are you sending? I've seen in the past something like 25
seconds for ~10k total
I think there is a shuffle stage involved. And the future count job will
depends on the first job’s shuffle stages’s output data directly as long as it
is still available. Thus it will be much faster.
Best Regards,
Raymond Liu
From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com]
Sent:
Thanks Raymond!
I was just reading the source code of ShuffledRDD, and found the the
ShuffleFetcher, which wraps BlockManager, does the magic.
The shuffled partitions will be stored in disk(?) just as what cacheManager
does in a persist operation.
Is that to say, whenever there is a shuffle
Hi Baoxu, thanks for sharing.
2014-06-26 22:51 GMT+08:00 Baoxu Shi(Dash) b...@nd.edu:
Hi Pei-Lun,
I have the same problem there. The Issue is SPARK-2228, there also someone
posted a pull request on that, but he only eliminate this exception but not
the side effects.
I think the problem
Hi Shannon,
How about a setting like the following? (just removed the quotes)
export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1
Not sure whats happening in your case, it could be that your system is not
able to bind to 192.168.1.101 address.
Try to explicitly set set the spark.driver.host property to the master's
IP.
Sujeet
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive
55 matches
Mail list logo