Forgot to post the solution. I messed up the master URL. In particular, I gave
the host (master), not a URL. My bad. The error message is weird, though. Seems
like the URL regex matches master for mesos://...
No idea about the Java Runtime Environment Error.
On Mar 26, 2014, at 3:52 PM,
thank you for your help! let me have a try
Nan Zhu wrote
If that’s the case, I think mapPartition is what you need, but it seems
that you have to load the partition into the memory as whole by toArray
rdd.mapPartition{D = {val p = D.toArray; ...}}
--
Nan Zhu
On Tuesday, April
Another thing I didn't mention. The AMI and user used: naturally I've
created several of my own AMIs with the following characteristics. None of
which worked.
1) Enabling ssh as root as per this guide (
http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
When doing this, I
Hi
I'm using Spark 0.9.0.
When calling saveAsTextFile on a custom hadoop inputformat (loaded with
newAPIHadoopRDD), I get the following error below.
If I call count, I get the correct count of number of records, so the
inputformat is being read correctly... the issue only appears when trying
to
I was able to keep the workaround ...around... by overwriting the
generated '/root/.ssh/authorized_keys' file with a known good one, in the
'/etc/rc.local' file
On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini
silvio.costant...@granatads.com wrote:
Another thing I didn't mention. The AMI and
Hi Flavio,
I happened to attend, actually attending the 2014 Apache Conf, I heard a
project called Apache Phoenix, which fully leverage HBase and suppose to
be 1000x faster than Hive. And it is not memory bounded, in which case sets
up a limit for Spark. It is still in the incubating group and
Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.
So yes, you can easily
Thanks for the quick reply Bin. Phenix is something I'm going to try for
sure but is seems somehow useless if I can use Spark.
Probably, as you said, since Phoenix use a dedicated data structure within
each HBase Table has a more effective memory usage but if I need to
deserialize data stored in a
i tried again with latest master, which includes commit below, but ui page
still shows nothing on storage tab.
koert
commit ada310a9d3d5419e101b24d9b41398f609da1ad3
Author: Andrew Or andrewo...@gmail.com
Date: Mon Mar 31 23:01:14 2014 -0700
[Hot Fix #42] Persisted RDD disappears on
That commit did work for me. Could you confirm the following:
1) After you called cache(), did you make any actions like count() or
reduce()? If you don't materialize the RDD, it won't show up in the
storage tab.
2) Did you run ./make-distribution.sh after you switched to the current master?
I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :
For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on
And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
To minimize
when i start spark-shell i now see
ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
directory
we do not package a lib_managed with our spark build (never did). maybe the
logic in compute-classpath.sh that searches for datanucleus should check
for the existence of
note that for a cached rdd in the spark shell it all works fine. but
something is going wrong with the spark-shell in our applications that
extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote:
i tried again with latest master, which includes
sorry, i meant to say: note that for a cached rdd in the spark shell it all
works fine. but something is going wrong with the SPARK-APPLICATION-UI in
our applications that extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:
note that for a
That commit fixed the exact problem you described. That is why I want to
confirm that you switched to the master branch. bin/spark-shell doesn't
detect code changes, so you need to run ./make-distribution.sh to
re-compile Spark first. -Xiangrui
On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully
cached using context.getRDDStorageInfo which we expose via our own api.
i did not run make-distribution.sh, we have our own scripts to build a
distribution. however if your question is if i correctly deployed the
latest
Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.
It looks like Apache Phoenix aims to provide flexible SQL access to data,
both for transactional and analytic
Hi Ankit,
Thanx for all the work on Pig.
Finally got it working. Couple of high level bugs right now:
- Getting it working on Spark 0.9.0
- Getting UDF working
- Getting generate functionality working
- Exhaustive test suite on Spark on Pig
are you maintaining a Jira somewhere?
I am
yes i am definitely using latest
On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:
That commit fixed the exact problem you described. That is why I want to
confirm that you switched to the master branch. bin/spark-shell doesn't
detect code changes, so you need to run
i put some println statements in BlockManagerUI
i have RDDs that are cached in memory. I see this:
*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
yet at same time i can see via our own api:
storageInfo: {
diskSize: 0,
memSize: 19944,
numCachedPartitions: 1,
numPartitions: 1
}
On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:
i put some println statements in BlockManagerUI
Yup, sorry about that. This error message should not produce incorrect
behavior, but it is annoying. Posted a patch to fix it:
https://github.com/apache/spark/pull/361
Thanks for reporting it!
On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote:
when i start spark-shell i
I am trying to read file from HDFS on Spark Shell and getting error as below.
When i create first RDD it works fine but when i try to do count on that
RDD, it trows me some connection error. I have single node hdfs setup and on
the same machine, i have spark running. Please help. When i run jps
Thank you -- this actually helped a lot. Strangely it appears that the task
detail view is not accurate in 0.8 -- that view shows 425ms duration for
one of the tasks, but in the driver log I do indeed see Finished TID 125 in
10940ms.
On that slow worker I see the following:
14/04/08 18:06:24
There are couple of issues here which i was able to find out.
1: We should not use web port which we use to access the web UI. I was usong
that initially so it was not working.
2: All request should go to Name node and not anything else.
3: By replacing localhost:9000 in the above request, it
Hi All,
I want to measure the total network traffic for a Spark Job. But I did
not see related information from the log. Does anybody know how to measure
it? Thanks very much in advance.
--
View this message in context:
Hi All,
I have some spatial data in postgres machine. I want to be able
to move that data to Hadoop and do some geo-processing.
I tried using sqoop to move the data to Hadoop but it complained about the
position data(which it says can't recognize)
Does anyone have any idea as to
Hello Manas,
I don't know Sqoop that much but my best guess is that you're probably
using Postgis which has specific structures for Geometry and so on. And if
you need some spatial operators my gut feeling is that things will be
harder ^^ (but a raw import won't need that...).
So I did a quick
Can Spark be configured to use SSL for all its network communication?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
1) at the end of the callback
2) yes we simply expose sc.getRDDStorageInfo to the user via REST
3) yes exactly. we define the RDDs at startup, all of them are cached. from
that point on we only do calculations on these cached RDDs.
i will add some more println statements for storageStatusList
our one cached RDD in this run has id 3
*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 - RDD 2
Not that I know of, but it would be great if that was supported. The way I
typically handle security now is to put the Spark servers in their own
subnet with strict inbound/outbound firewalls.
On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote:
Can Spark be configured to use
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is being used
and on what machines.
https://spark.apache.org/docs/0.9.0/monitoring.html
On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote:
Hi All,
If you want the machine that hosts the driver to also do work, you can
designate it as a worker too, if I'm not mistaken. I don't think the
driver should do work, logically, but, that's not to say that the
machine it's on shouldn't do work.
--
Sean Owen | Director, Data Science | London
On Tue,
One thing you could do is create an RDD of [1,2,3] and set a partitioner
that puts all three values on their own nodes. Then .foreach() over the
RDD and call your function that will run on each node.
Why do you need to run the function on every node? Is it some sort of
setup code that needs to
Alright, so I guess I understand now why spark-ec2 allows you to select
different instance types for the driver node and worker nodes. If the
driver node is just driving and not doing any large collect()s or heavy
processing, it can be much smaller than the worker nodes.
With regards to data
may be unrelated to the question itself, just FYI
you can run your driver program in worker node with Spark-0.9
http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
Best,
--
Nan Zhu
On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas
Hi guys,
We're going to hold a series of meetups about machine learning with
Spark in San Francisco.
The first one will be on April 24. Xiangrui Meng from Databricks will
talk about Spark, Spark/Python, features engineering, and MLlib.
See
Dears,
I'm very interested in this. However, the links mentioned above are not
accessible from China. Is there any other way to read the two blog pagess?
Thanks a lot.
2014-04-08 12:54 GMT+08:00 prabeesh k prabsma...@gmail.com:
Hi all,
Here I am sharing a blog for beginners, about creating
Thanks Andrew, I will take a look at it.
On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List]
ml-node+s1001560n3920...@n3.nabble.com wrote:
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is
Anybody, please help for abov e query.
It's challanging but will open new horizon for In-Memory analysis.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html
Sent from the Apache Spark User List
Hi ,
I am getting java.io.NotSerializableException exception while executing
following program.
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
class Vector (val data: Array[Double]) {}
implicit object VectorAP
Dear list,
SBT compiles fine, but when I do the following:
sbt/sbt gen-idea
import project as SBT project to IDEA 13.1
Make Project
and these errors show up:
Error:(28, 8) object FileContext is not a member of package
org.apache.hadoop.fs
import org.apache.hadoop.fs.{FileContext, FileStatus,
Hi Dong,
This is pretty much what I did. I run into the same issue you have.
Since I'm not developing yarn related stuff, I just excluded those two
yarn related project from intellji, and it works. PS, you may need to
exclude java8 project as well now.
Sincerely,
DB Tsai
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data Science | London
On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo monted...@gmail.com wrote:
Dear list,
SBT compiles fine, but when I do the following:
sbt/sbt gen-idea
import project as SBT project to
After sbt/sbt gen-diea, do not import as an SBT project but choose
open project and point it to the spark folder. -Xiangrui
On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen so...@cloudera.com wrote:
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak ano...@soe.ucsc.edu wrote:
What, exactly, needs to be true about the RDDs that you pass to Graph() to
be sure of constructing a valid graph? (Do they need to have the same
number of partitions? The same number of partitions and no empty
partitions?
Hi All,
I am getting this exception when doing ssc.start to start the streaming
context.
ERROR KafkaReceiver - Error receiving data
akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not
unique!
at
Hi to everybody,
I'm new to Spark and I'd like to know if running Spark on top of YARN or
Mesos could affect (and how much) its performance. Is there any doc about
this?
Best,
Flavio
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the
Which persistence level are you talking about? MEMORY_AND_DISK ?
Sent from my mobile phone
On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Thanks, Andrew. That helps.
For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default to
the Spark-provided AMI.
Nick
On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini
silvio.costant...@granatads.com wrote:
Hi there,
To answer your question; no there is no reason NOT to use an AMI that
Spark has
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.
On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default to
Up until last week we had no problems running a Spark standalone cluster. We
now have a problem registering executors with the driver node in any
application. Although we can start-all and see the worker on 8080 no
executors are registered with the blockmanager.
The feedback we have is scant but
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
AMIs.
On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.
The AMI should automatically switch between PVM and HVM based on the
instance type you specify on the command line. For reference (note you
don't need to specify this on the command line), the PVM ami id
is ami-5bb18832 in us-east-1.
FWIW we maintain the list of AMI Ids (across regions and pvm,
The typical way to handle that use case would be to join the 3 files
together into one RDD and then do the factorization on that. There will
definitely be network traffic during the initial join to get everything
into one table, and after that there will likely be more network traffic
for various
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy. As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.
What I'd expect to happen would be that the RDD is
Dear list,
A quick question about spark streaming:
Say I have this stage set up in my Spark Streaming cluster:
batched TCP stream == map(expensive computation) === ReduceByKey
I know I can set the number of tasks for ReduceByKey.
But I didn't find a place to specify the parallelism for the
Hi Dave,
This is HBase solution to the poor scan performance issue:
https://issues.apache.org/jira/browse/HBASE-8369
I encountered the same issue before.
To the best of my knowledge, this is not a mapreduce issue. It is hbase
issue. If you are planning to swap out mapreduce and replace it with
Also Driver can run on one of the slave nodes. (you will stil need a spark
master though for resource allocation etc).
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Apr 8, 2014 at 2:46 PM, Nan Zhu
I am pretty new to Spark and I am trying to run the spark shell on a Yarn
cluster from the cli (in yarn-client mode). I am able to start the shell with
the following command:
SPARK_JAR=../spark-0.9.0-incubating/jars/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar
\ SPARK_YARN_APP_JAR=emptyfile
Never mind...plz return it later with interest
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p4014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark Streaming job was running on two worker nodes and then there was an
error on one of the nodes. The spark job showed running but no progress was
being made and not processing any new messages. Based on the driver log
files I see the following errors.
I would expect the stream reading would
Howdy.
Is it possible to initiate Spark jobs from Oozie (presumably as a java action)?
If so, are there known limitations to this? And would anybody have a pointer
to an example?
Thanks,
Nate
Thanks Prabeesh.
On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k prabsma...@gmail.com wrote:
Please refer
http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html
Regards,
prabeesh
On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji
pradeep.chanum...@gmail.comwrote:
Hi everyone,
We have just posted Spark 0.9.1, which is a maintenance release with
bug fixes, performance improvements, better stability with YARN and
improved parity of the Scala and Python API. We recommend all 0.9.0
users to upgrade to this stable release.
This is the first release since Spark
A small additional note: Please use the direct download links in the Spark
Downloads http://spark.apache.org/downloads.html page. The Apache mirrors
take a day or so to sync from the main repo, so may not work immediately.
TD
On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das
Hi all,
I have been able to run LR in local mode, but I am facing problem to run
it in cluster mode, below is the source script, and stack trace when
running it cluster mode, I used sbt package to build the project, not sure
what it is complaining?
another question I have is for
Hi,
I want to enable Spark Master HA in spark. Documentation specifies that we
can do this with the help of Zookeepers. But what I am worried is how to
configure one master with the other and similarly how do workers know that
the have two masters? where do you specify the multi-master
Thanks TD for managing this release, and thanks to everyone who contributed!
Matei
On Apr 9, 2014, at 2:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote:
A small additional note: Please use the direct download links in the Spark
Downloads page. The Apache mirrors take a day or so to
Hi Jenny,
How are you packaging your jar.
Can you please confirm if you have included the Mlib jar inside the fat jar
you have created for your code.
libraryDependencies += org.apache.spark % spark-mllib_2.9.3 %
0.8.1-incubating
Thanks,
Jagat Singh
On Thu, Apr 10, 2014 at 8:05 AM, Jenny
The only way i know to do this is to use mesos with zookeepers. you specify
zookeeper url as spark url that contains multiple zookeeper hosts. Multiple
mesos masters are then elected thru zookeeper leader election until current
leader dies; at which point mesos will elect another master (if still
Hey Patrick,
I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
No use case at the moment.
Thanks Dmitriy. But I want multi master support when running spark
standalone. Also I want to know if this multi master thing works if I use
spark-shell.
On Wed, Apr 9, 2014 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
The only way i know to do this is to use mesos with zookeepers.
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
0.9.0:
- https://issues.apache.org/jira/browse/SPARK-1323
SNAPSHOT 2014-03-18:
- When persist() used and batchSize=1, java.lang.OutOfMemoryError:
Java heap space. To me this indicates a memory
Hi Jagat,
yes, I did specify mllib in build.sbt
name := Spark LogisticRegression
version :=1.0
scalaVersion := 2.10.3
libraryDependencies += org.apache.spark % spark-core_2.10 %
0.9.0-incubating
libraryDependencies += org.apache.spark % spark-mllib_2.10 %
0.9.0-incubating
A very nice addition for us PySpark users in 0.9.1 is the addition of
RDD.repartition(), which is not mentioned in the release
noteshttp://spark.apache.org/releases/spark-release-0-9-1.html
!
This is super helpful for when you create an RDD from a gzipped file and
then need to explicitly shuffle
Okay, thanks. Do you have any info on how large your records and data file are?
I’d like to reproduce and fix this.
Matei
On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
I set SPARK_MEM in the driver process by setting
spark.executor.memory to 10G. Each machine had 32G of RAM and a
dedicated 32G spill volume. I believe all of the units are in pages,
and the page size is the standard 4K. There are 15 slave nodes in the
cluster and the sizes of the datasets I'm
Ah, looks good now. It took me a minute to realize that doing a hard
refresh on the docs page was missing the RDD class doc page...
And thanks for updating the release notes.
On Wed, Apr 9, 2014 at 7:21 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
Thanks Nick for pointing that out! I
Good question. This is something we wanted to fix, but unfortunately I'm
not sure how to do it without changing the API to RDD, which is undesirable
now that the 1.0 branch has been cut. We should figure something out though
for 1.1.
I've created https://issues.apache.org/jira/browse/SPARK-1460
It is as Jagat said. The Masters do not need to know about one another, as
ZooKeeper manages their implicit communication. As for Workers (and
applications, such as spark-shell), once a Worker is registered with
*some *Master,
its metadata is stored in ZooKeeper such that if another Master is
thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
--
View this message in context:
Yeah, should be right
--
Nan Zhu
On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote:
thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
This dataset is uncompressed text at ~54GB. stats() returns (count:
56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
343)
On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Okay, thanks. Do you have any info on how large your records and data file
I haven’t seen this but it may be a bug in Typesafe Config, since this is
serializing a Config object. We don’t actually use Typesafe Config ourselves.
Do you have any nulls in the data itself by any chance? And do you know how
that Config object is getting there?
Matei
On Apr 9, 2014, at
Ok I thought it may be closing over the config option. I am using config
for job configuration, but extracting vals from that. So not sure why as I
thought I'd avoided closing over it. Will go back to source and see where
it is creeping in.
On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia
rdd.foreach(p = {
print(p)
})
The above closure gets executed on workers, you need to look at the logs of
the workers to see the output.
but if i'm in local mode, where's the logs of local driver, there are no
/logs and /work dirs in /SPARK_HOME which are set in standalone mode.
--
View
hi, you can take a look here:
http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html
Sent from the Apache Spark User List mailing list archive at
Hi Mayur,
I wondered if you could share your findings in some way (github, blog post,
etc). I guess your experience will be very interesting/useful for many
people
sent from Lenovo YogaTablet
On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Hi Ankit,
Thanx for all the work
Yes that help to understand better how works spark. But that was also what I
was afraid, I think the network communications will take to much time for my
job.
I will continue to look for a trick in order to not have network
communications.
I saw on the hadoop website that : To minimize global
You need to use proper HDFS URI with saveAsTextFile.
For Example:
rdd.saveAsTextFile(hdfs://NameNode:Port/tmp/Iris/output.tmp)
Regards,
Adnan
Asaf Lahav wrote
Hi,
We are using Spark with data files on HDFS. The files are stored as files
for predefined hadoop user (hdfs).
The folder is
Then problem is not on spark side, you have three options, choose any one of
them:
1. Change permissions on /tmp/Iris folder from shell on NameNode with hdfs
dfs -chmod command.
2. Run your hadoop service with hdfs user.
3. Disable dfs.permissions in conf/hdfs-site.xml.
Regards,
Adnan
avito
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed question. :)
We are at the very beginning of a large scale big data project and after
two months of exploration work we'd like to settle on the technologies to
use, roll up our sleeves
When you say Spark is one of the forerunners for our technology choice,
what are the other options you are looking into ?
I start cross validation runs on a 40 core, 160 GB spark job using a
script...I woke up in the morning, none of the jobs crashed ! and the
project just came out of incubation
Hi Asaf,
The user who run SparkContext is decided by the below code in SparkContext,
normally this user.name is the user who started JVM, you can start your
application with -Duser.name=xxx to specify a username you want, this specified
username will be the user to communicate with HDFS.
val
Spark has been endorsed by Cloudera as the successor to MapReduce. That
says a lot...
On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed
I am getting a python ImportError on Spark standalone cluster. I have set the
PYTHONPATH on both worker and slave and the package imports properly when I
run PySpark command line on both machines. This only happens with Master -
Slave communication. Here is the error below:
14/04/10 13:40:19
901 - 1000 of 75493 matches
Mail list logo