Hi,
Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to
throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already
exists.
Is there a way I can allow Spark to overwrite the existing file?
Cheers,
Kexin
+1 Same question here...
Message sent from a mobile device - excuse typos and abbreviations
Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit :
Hi,
Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw
The function saveAsTextFile
https://github.com/apache/spark/blob/7d9cc9214bd06495f6838e355331dd2b5f1f7407/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1066
is
a wrapper around saveAsHadoopFile
Hi Michaël,
Thanks for this. We could indeed do that.
But I guess the question is more about the change of behaviour from 0.9.1 to
1.0.0.
We never had to care about that in previous versions.
Does that mean we have to manually remove existing files or is there a way
to aumotically overwrite
Dear all,
Does spark support sparse matrix/vector for LR now?
Best,
Wush
2014/6/2 下午3:19 於 praveshjain1991 praveshjain1...@gmail.com 寫道:
Thank you for your replies. I've now been using integer datasets but ran
into
another issue.
OK, rebuilding the assembly jar file with cdh5 works now...
Thanks..
-Simon
On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote:
That helped a bit... Now I have a different failure: the start up process
is stuck in an infinite loop outputting the following message:
Hi folks,
I have a weird problem when using pyspark with yarn. I started ipython as
follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G
When I create a notebook, I can see workers being created and indeed I see
spark UI running on my
Thanks! This is even closer to what I am looking for. I'm in a trip now, so
I'm going to give it a try when I come back.
On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao ngocdaoth...@gmail.com wrote:
Alternative solution:
https://github.com/xitrum-framework/xitrum-package
It collects all dependency
Dear PJ$,
If you are familiar with Puppet, you could try using the puppet module I wrote
(currently for Spark 0.9.0, I custom compiled it since no Debian package was
available at the time I started with a project I required it for).
https://github.com/stefanvanwouw/puppet-spark
---
Kind
Hi Simon,
You shouldn't have to install pyspark on every worker node. In YARN mode,
pyspark is packaged into your assembly jar and shipped to your executors
automatically. This seems like a more general problem. There are a few
things to try:
1) Run a simple pyspark shell with yarn-client, and
1) yes, that sc.parallelize(range(10)).count() has the same error.
2) the files seem to be correct
3) I have trouble at this step, ImportError: No module named pyspark
but I seem to have files in the jar file:
$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
import pyspark
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers,
it seems that the jar file is distributed and included in classpath
correctly.
I think the problem is likely at step 3..
I build my jar file with maven, like this:
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1
Indeed, the behavior has changed for good or for bad. I mean, I agree with the
danger you mention but I'm not sure it's happening like that. Isn't there a
mechanism for overwrite in Hadoop that automatically removes part files, then
writes a _temporary folder and then only the part files along
Hey Mayur,
Thanks for the suggestion, I didn't realize that was configurable. I don't
think I'm running out of memory, though it does seem like these errors go
away when i turn off the spark.streaming.unpersist configuration and use
spark.cleaner.ttl instead. Do you know if there are known
Looks like just worker and master processes are running:
[hivedata@hivecluster2 ~]$ jps
10425 Jps
[hivedata@hivecluster2 ~]$ ps aux|grep spark
hivedata 10424 0.0 0.0 103248 820 pts/3S+ 10:05 0:00 grep spark
root 10918 0.5 1.4 4752880 230512 ? Sl May27 41:43 java -cp
http://hortonworks.com/blog/ddm/#.U4yn3gJgfts.twitter
The receivers are submitted as tasks. They are supposed to be assigned
to the executors in a round-robin manner by
TaskSchedulerImpl.resourceOffers(). However, sometimes not all the
executors are registered when the receivers are submitted. That's why
the receivers fill up the registered executors
If it matters, I have servers running at
http://hivecluster2:4040/stages/ and http://hivecluster2:4041/stages/
When I run rdd.first, I see an item at
http://hivecluster2:4041/stages/ but no tasks are running. Stage ID 1,
first at console:46, Tasks: Succeeded/Total 0/16.
On Mon, Jun 2, 2014 at
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
import pyspark
This following pull request did mention something about generating a zip
file for all python related modules:
Okay I'm guessing that our upstreaming Hadoop2 package isn't new
enough to work with CDH5. We should probably clarify this in our
downloads. Thanks for reporting this. What was the exact string you
used when building? Also which CDH-5 version are you building against?
On Mon, Jun 2, 2014 at 8:11
I built my new package like this:
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean
package
Spark-shell is working now, but pyspark is still broken. I reported the
problem on a different thread. Please take a look if you can... Desperately
need ideas..
Thanks.
-Simon
On
Hey There,
The issue was that the old behavior could cause users to silently
overwrite data, which is pretty bad, so to be conservative we decided
to enforce the same checks that Hadoop does.
This was documented by this JIRA:
https://issues.apache.org/jira/browse/SPARK-1100
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html
And my jar file has 70011 files. Fantastic..
On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote:
I asked several people, no one seems to believe that we can do this:
$
Hi, Patrick,
I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the
same thing?
How about assigning it to me?
I think I missed the configuration part in my previous commit, though I
declared that in the PR description….
Best,
--
Nan Zhu
On Monday, June 2,
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, Patrick,
I think
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:
https://github.com/apache/spark/blob/master/make-distribution.sh#L102
Any luck if you use JDK 6 to compile?
The RDD API has functions to join multiple RDDs, such as PariRDD.join
or PariRDD.cogroup that take another RDD as input. e.g.
firstRDD.join(secondRDD)
I'm looking for ways to do the opposite: split an existing RDD. What is the
right way to create derivate RDDs from an existing RDD?
e.g.
Nope... didn't try java 6. The standard installation guide didn't say
anything about java 7 and suggested to do -DskipTests for the build..
http://spark.apache.org/docs/latest/building-with-maven.html
So, I didn't see the warning message...
On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell
+1 please re-add this feature
On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote:
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at
You may have to do sudo jps, because it should definitely list your
processes.
What does hivecluster2:8080 look like? My guess is it says there are 2
applications registered, and one has taken all the executors. There must be
two applications running, as those are the only things that keep open
So in summary:
- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
- There is an open JIRA issue to add an option to allow clobbering.
- Even when clobbering, part- files may be left over from previous
saves, which is dangerous.
Is this correct?
On Mon, Jun 2,
Hi everyone,
I would like to setup a very simple cluster (specifically using 2 micro
instances only) of Spark on EC2 and make it run a simple Spark Streaming
application I created.
Someone actually managed to do that?
Because after launching the scripts from this page:
Yes.
On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
So in summary:
- As of Spark 1.0.0, saveAsTextFile() will no longer clobber by
default.
- There is an open JIRA issue to add an option to allow clobbering.
- Even when clobbering, part-
Nothing appears to be running on hivecluster2:8080.
'sudo jps' does show
[hivedata@hivecluster2 ~]$ sudo jps
9953 PepAgent
13797 JournalNode
7618 NameNode
6574 Jps
12716 Worker
16671 RunJar
18675 Main
18177 JobTracker
10918 Master
18139 TaskTracker
7674 DataNode
I kill all processes listed. I
OK, thanks for confirming. Is there something we can do about that leftover
part- files problem in Spark, or is that for the Hadoop team?
2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
Yes.
On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I'm a bit confused because the PR mentioned by Patrick seems to adress all
these issues:
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
Was it not accepted? Or is the description of this PR not completely
implemented?
Message sent from a mobile device - excuse
I assume the idea is for Spark to rm -r dir/, which would clean out
everything that was there before. It's just doing this instead of the
caller. Hadoop still won't let you write into a location that already
exists regardless, and part of that is for this reason that you might
end up with files
Fair enough. That rationale makes sense.
I would prefer that a Spark clobber option also delete the destination
files, but as long as it's a non-default option I can see the caller
beware side of that argument as well.
Nick
2014년 6월 2일 월요일, Sean Owenso...@cloudera.com님이 작성한 메시지:
I assume the
I made the PR, the problem is …after many rounds of review, that configuration
part is missed….sorry about that
I will fix it
Best,
--
Nan Zhu
On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote:
I'm a bit confused because the PR mentioned by Patrick seems to adress all
Hi all,
Seeing a random exception kill my spark streaming job. Here's a stack
trace:
java.util.NoSuchElementException: key not found: 32855
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at
Hello Spark fans,
I am trying to log messages from my spark application. When the main()
function attempts to log, using log.info() it works great, but when I try
the same command from the code that probably runs on the worker, I
initially got an serialization error. To solve that, I created a
Hi,
I've upgraded to Spark 1.0.0. I'm not able to run any tests. They throw a
*java.lang.SecurityException: class
javax.servlet.FilterRegistration's signer information does not match
signer information of other classes in the same package*
I'm using Hadoop-core 1.0.4 and running this locally.
I
We can just add back a flag to make it backwards compatible - it was
just missed during the original PR.
Adding a *third* set of clobber semantics, I'm slightly -1 on that
for the following reasons:
1. It's scary to have Spark recursively deleting user files, could
easily lead to users deleting
This ultimately means you have a couple copies of the servlet APIs in
the build. What is your build like (SBT? Maven?) and what exactly are
you depending on?
On Tue, Jun 3, 2014 at 12:21 AM, Mohit Nayak wiza...@gmail.com wrote:
Hi,
I've upgraded to Spark 1.0.0. I'm not able to run any tests.
Is there a third way? Unless I miss something. Hadoop's OutputFormat
wants the target dir to not exist no matter what, so it's just a
question of whether Spark deletes it for you or errors.
On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell pwend...@gmail.com wrote:
We can just add back a flag to
Hey,
Thanks for the reply.
I am using SBT. Here is a list of my dependancies:
val sparkCore= org.apache.spark % spark-core_2.10 % V.spark
val hadoopCore = org.apache.hadoop % hadoop-core %
V.hadoop% provided
val jodaTime = com.github.nscala-time %% nscala-time
Hi,
How do one process for data sources other than text?
Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
process them?
How does one go about it.
I have never been able to figure this out..
Lets say I have this library in python which works like following:
import
Hi Jamal,
If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.
Or you could write the file list to a file, and then use
Currently Spark Streaming does not support addition/deletion/modification
of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts)
cannot be run concurrently in the same JVM.
I asked a question related to Marcelo's answer a few months ago. The
discussion there may be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:
Hi Jamal,
If what you want is to process lots of files in parallel, the
Hi Marcelo,
Thanks for the response..
I am not sure I understand. Can you elaborate a bit.
So, for example, lets take a look at this example
http://pythonvision.org/basic-tutorial
import mahotas
dna = mahotas.imread('dna.jpeg')
dnaf = ndimage.gaussian_filter(dna, 8)
But except dna.jpeg Lets
Thanks. Let me go thru it.
On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren philip.og...@oracle.com
wrote:
I asked a question related to Marcelo's answer a few months ago. The
discussion there may be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
On
Hi all,
I am getting an error:
14/06/02 17:06:32 INFO WindowedDStream: Time 1401753992000 ms is invalid as
zeroTime is 1401753986000 ms and slideDuration is 4000 ms and difference is
6000 ms
14/06/02 17:06:32 ERROR OneForOneStrategy: key not found: 1401753992000 ms
The idea is simple. If you want to run something on a collection of
files, do (in pseudo-python):
def processSingleFile(path):
# Your code to process a file
files = [ file1, file2 ]
sc.parallelize(files).foreach(processSingleFile)
On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha
Phoofff.. (Mind blown)...
Thank you sir.
This is awesome
On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin van...@cloudera.com wrote:
The idea is simple. If you want to run something on a collection of
files, do (in pseudo-python):
def processSingleFile(path):
# Your code to process a file
Hey,
Yup that fixed it. Thanks so much!
Is this the only solution, or could this be resolved in future versions of
Spark ?
On Mon, Jun 2, 2014 at 5:14 PM, Sean Owen so...@cloudera.com wrote:
If it's the SBT build, I suspect you are hitting
https://issues.apache.org/jira/browse/SPARK-1949
I am assuming that you are referring to the OneForOneStrategy: key not
found: 1401753992000 ms error, and not to the previous Time 1401753992000
ms is invalid Those two seem a little unrelated to me. Can you give
us the stacktrace associated with the key-not-found error?
TD
On Mon, Jun 2,
You can just use the Maven build for now, even for Spark 1.0.0.
Matei
On Jun 2, 2014, at 5:30 PM, Mohit Nayak wiza...@gmail.com wrote:
Hey,
Yup that fixed it. Thanks so much!
Is this the only solution, or could this be resolved in future versions of
Spark ?
On Mon, Jun 2, 2014 at
Ok, it seems like Time ... is invalid is part of normal workflow, when
window DStream will ignore RDDs at moments in time when they do not match
to the window sliding interval. But why am I getting exception is still
unclear. Here is the full stack:
14/06/02 17:21:48 INFO WindowedDStream: Time
Do you have the info level logs of the application? Can you grep the
value 32855
to find any references to it? Also what version of the Spark are you using
(so that I can match the stack trace, does not seem to match with Spark
1.0)?
TD
On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang
Can you give all the logs? Would like to see what is clearing the key
1401754908000
ms
TD
On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan kot.bege...@gmail.com wrote:
Ok, it seems like Time ... is invalid is part of normal workflow, when
window DStream will ignore RDDs at moments in time when
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
output format check and overwrite files in the destination directory.
But it won't clobber the directory entirely. I.e. if the directory
already had part1 part2 part3 part4 and you write a new job
outputing only two files (part1,
Yeah we need to add a build warning to the Maven build. Would you be
able to try compiling Spark with Java 6? It would be good to narrow
down if you hare hitting this problem or something else.
On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen xche...@gmail.com wrote:
Nope... didn't try java 6.
I remember that in the earlier version of that PR, I deleted files by calling
HDFS API
we discussed and concluded that, it’s a bit scary to have something directly
deleting user’s files in Spark
Best,
--
Nan Zhu
On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote:
(A) Semantics
The usual way to use Spark with SBT is to package a Spark project using sbt
package (e.g. per Quick Start) and submit it to Spark using the bin/ scripts
from Sark distribution. For plain Scala project, you don’t need to download
anything, you can just get a build.sbt file with dependencies and
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote:
(B) Semantics in Spark 1.0 and earlier:
Do you mean 1.0 and later?
Option (B) with the exception-on-clobber sounds fine to me, btw. My use
pattern is probably common but not universal, and deleting user files is
indeed
spark 0.9.1
textInput is a JavaRDD object
i am programming in Java
2014-06-03
bluejoe2008
From: Michael Armbrust
Date: 2014-06-03 10:09
To: user
Subject: Re: how to construct a ClassTag object as a method parameter in Java
What version of Spark are you using? Also are you sure the type of
+1 on Option (B) with flag to allow semantics in (A) for back compatibility.
Kexin
On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com
wrote:
(B) Semantics in Spark 1.0 and earlier:
Do
Thanks for looking into this Tathagata.
Are you looking for traces of ReceiveInputDStream.clearMetadata call?
Here is the log: http://wepaste.com/vchekan
Vadim.
On Mon, Jun 2, 2014 at 5:58 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Can you give all the logs? Would like to see what
Good catch! Yes I meant 1.0 and later.
On Mon, Jun 2, 2014 at 8:33 PM, Kexin Xie kexin@bigcommerce.com wrote:
+1 on Option (B) with flag to allow semantics in (A) for back compatibility.
Kexin
On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
On Mon,
Hi,
I had the same problem with pyspark. Here's how I resolved it:
What I've found in python (not sure about scala) is that if the function
being serialized was written in the same python module as the main
function, then logging fails. If the serialized function is in a separate
module, then
71 matches
Mail list logo