How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Kexin Xie
Hi, Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already exists. Is there a way I can allow Spark to overwrite the existing file? Cheers, Kexin

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
+1 Same question here... Message sent from a mobile device - excuse typos and abbreviations Le 2 juin 2014 à 10:08, Kexin Xie kexin@bigcommerce.com a écrit : Hi, Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to throw

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Michael Cutler
The function saveAsTextFile https://github.com/apache/spark/blob/7d9cc9214bd06495f6838e355331dd2b5f1f7407/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1066 is a wrapper around saveAsHadoopFile

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre B
Hi Michaël, Thanks for this. We could indeed do that. But I guess the question is more about the change of behaviour from 0.9.1 to 1.0.0. We never had to care about that in previous versions. Does that mean we have to manually remove existing files or is there a way to aumotically overwrite

Re: Using String Dataset for Logistic Regression

2014-06-02 Thread Wush Wu
Dear all, Does spark support sparse matrix/vector for LR now? Best, Wush 2014/6/2 下午3:19 於 praveshjain1991 praveshjain1...@gmail.com 寫道: Thank you for your replies. I've now been using integer datasets but ran into another issue.

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message:

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-02 Thread Andrei
Thanks! This is even closer to what I am looking for. I'm in a trip now, so I'm going to give it a try when I come back. On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao ngocdaoth...@gmail.com wrote: Alternative solution: https://github.com/xitrum-framework/xitrum-package It collects all dependency

Re: Trouble with EC2

2014-06-02 Thread Stefan van Wouw
Dear PJ$, If you are familiar with Puppet, you could try using the puppet module I wrote (currently for Spark 0.9.0, I custom compiled it since no Debian package was available at the time I started with a project I required it for). https://github.com/stefanvanwouw/puppet-spark --- Kind

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step, ImportError: No module named pyspark but I seem to have files in the jar file: $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python import pyspark

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly. I think the problem is likely at step 3.. I build my jar file with maven, like this: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
Indeed, the behavior has changed for good or for bad. I mean, I agree with the danger you mention but I'm not sure it's happening like that. Isn't there a mechanism for overwrite in Hadoop that automatically removes part files, then writes a _temporary folder and then only the part files along

Re: Failed to remove RDD error

2014-06-02 Thread Michael Chang
Hey Mayur, Thanks for the suggestion, I didn't realize that was configurable. I don't think I'm running out of memory, though it does seem like these errors go away when i turn off the spark.streaming.unpersist configuration and use spark.cleaner.ttl instead. Do you know if there are known

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
Looks like just worker and master processes are running: [hivedata@hivecluster2 ~]$ jps 10425 Jps [hivedata@hivecluster2 ~]$ ps aux|grep spark hivedata 10424 0.0 0.0 103248 820 pts/3S+ 10:05 0:00 grep spark root 10918 0.5 1.4 4752880 230512 ? Sl May27 41:43 java -cp

Is Hadoop MR now comparable with Spark?

2014-06-02 Thread Ian Ferreira
http://hortonworks.com/blog/ddm/#.U4yn3gJgfts.twitter

Re: [Spark Streaming] Distribute custom receivers evenly across excecutors

2014-06-02 Thread Guang Gao
The receivers are submitted as tasks. They are supposed to be assigned to the executors in a round-robin manner by TaskSchedulerImpl.resourceOffers(). However, sometimes not all the executors are registered when the receivers are submitted. That's why the receivers fill up the registered executors

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
If it matters, I have servers running at http://hivecluster2:4040/stages/ and http://hivecluster2:4041/stages/ When I run rdd.first, I see an item at http://hivecluster2:4041/stages/ but no tasks are running. Stage ID 1, first at console:46, Tasks: Succeeded/Total 0/16. On Mon, Jun 2, 2014 at

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python import pyspark This following pull request did mention something about generating a zip file for all python related modules:

Re: spark 1.0.0 on yarn

2014-06-02 Thread Patrick Wendell
Okay I'm guessing that our upstreaming Hadoop2 package isn't new enough to work with CDH5. We should probably clarify this in our downloads. Thanks for reporting this. What was the exact string you used when building? Also which CDH-5 version are you building against? On Mon, Jun 2, 2014 at 8:11

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
I built my new package like this: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package Spark-shell is working now, but pyspark is still broken. I reported the problem on a different thread. Please take a look if you can... Desperately need ideas.. Thanks. -Simon On

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Hey There, The issue was that the old behavior could cause users to silently overwrite data, which is pretty bad, so to be conservative we decided to enforce the same checks that Hadoop does. This was documented by this JIRA: https://issues.apache.org/jira/browse/SPARK-1100

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote: I asked several people, no one seems to believe that we can do this: $

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the same thing? How about assigning it to me? I think I missed the configuration part in my previous commit, though I declared that in the PR description…. Best, -- Nan Zhu On Monday, June 2,

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Patrick, I think

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile?

How to create RDDs from another RDD?

2014-06-02 Thread Gerard Maas
The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join(secondRDD) I'm looking for ways to do the opposite: split an existing RDD. What is the right way to create derivate RDDs from an existing RDD? e.g.

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do -DskipTests for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
+1 please re-add this feature On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Aaron Davidson
You may have to do sudo jps, because it should definitely list your processes. What does hivecluster2:8080 look like? My guess is it says there are 2 applications registered, and one has taken all the executors. There must be two applications running, as those are the only things that keep open

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
So in summary: - As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default. - There is an open JIRA issue to add an option to allow clobbering. - Even when clobbering, part- files may be left over from previous saves, which is dangerous. Is this correct? On Mon, Jun 2,

EC2 Simple Cluster

2014-06-02 Thread Gianluca Privitera
Hi everyone, I would like to setup a very simple cluster (specifically using 2 micro instances only) of Spark on EC2 and make it run a simple Spark Streaming application I created. Someone actually managed to do that? Because after launching the scripts from this page:

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
Yes. On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So in summary: - As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default. - There is an open JIRA issue to add an option to allow clobbering. - Even when clobbering, part-

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Russell Jurney
Nothing appears to be running on hivecluster2:8080. 'sudo jps' does show [hivedata@hivecluster2 ~]$ sudo jps 9953 PepAgent 13797 JournalNode 7618 NameNode 6574 Jps 12716 Worker 16671 RunJar 18675 Main 18177 JobTracker 10918 Master 18139 TaskTracker 7674 DataNode I kill all processes listed. I

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
OK, thanks for confirming. Is there something we can do about that leftover part- files problem in Spark, or is that for the Hadoop team? 2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지: Yes. On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre Borckmans
I'm a bit confused because the PR mentioned by Patrick seems to adress all these issues: https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 Was it not accepted? Or is the description of this PR not completely implemented? Message sent from a mobile device - excuse

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Sean Owen
I assume the idea is for Spark to rm -r dir/, which would clean out everything that was there before. It's just doing this instead of the caller. Hadoop still won't let you write into a location that already exists regardless, and part of that is for this reason that you might end up with files

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
Fair enough. That rationale makes sense. I would prefer that a Spark clobber option also delete the destination files, but as long as it's a non-default option I can see the caller beware side of that argument as well. Nick 2014년 6월 2일 월요일, Sean Owenso...@cloudera.com님이 작성한 메시지: I assume the

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
I made the PR, the problem is …after many rounds of review, that configuration part is missed….sorry about that I will fix it Best, -- Nan Zhu On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote: I'm a bit confused because the PR mentioned by Patrick seems to adress all

NoSuchElementException: key not found

2014-06-02 Thread Michael Chang
Hi all, Seeing a random exception kill my spark streaming job. Here's a stack trace: java.util.NoSuchElementException: key not found: 32855 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at

using Log4j to log INFO level messages on workers

2014-06-02 Thread Shivani Rao
Hello Spark fans, I am trying to log messages from my spark application. When the main() function attempts to log, using log.info() it works great, but when I try the same command from the code that probably runs on the worker, I initially got an serialization error. To solve that, I created a

Fwd: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hi, I've upgraded to Spark 1.0.0. I'm not able to run any tests. They throw a *java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package* I'm using Hadoop-core 1.0.4 and running this locally. I

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
We can just add back a flag to make it backwards compatible - it was just missed during the original PR. Adding a *third* set of clobber semantics, I'm slightly -1 on that for the following reasons: 1. It's scary to have Spark recursively deleting user files, could easily lead to users deleting

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Sean Owen
This ultimately means you have a couple copies of the servlet APIs in the build. What is your build like (SBT? Maven?) and what exactly are you depending on? On Tue, Jun 3, 2014 at 12:21 AM, Mohit Nayak wiza...@gmail.com wrote: Hi, I've upgraded to Spark 1.0.0. I'm not able to run any tests.

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Sean Owen
Is there a third way? Unless I miss something. Hadoop's OutputFormat wants the target dir to not exist no matter what, so it's just a question of whether Spark deletes it for you or errors. On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell pwend...@gmail.com wrote: We can just add back a flag to

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hey, Thanks for the reply. I am using SBT. Here is a list of my dependancies: val sparkCore= org.apache.spark % spark-core_2.10 % V.spark val hadoopCore = org.apache.hadoop % hadoop-core % V.hadoop% provided val jodaTime = com.github.nscala-time %% nscala-time

Processing audio/video/images

2014-06-02 Thread jamal sasha
Hi, How do one process for data sources other than text? Lets say I have millions of mp3 (or jpeg) files and I want to use spark to process them? How does one go about it. I have never been able to figure this out.. Lets say I have this library in python which works like following: import

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use

Re: Interactive modification of DStreams

2014-06-02 Thread Tathagata Das
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started. Nor can you restart a stopped streaming context. Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.

Re: Processing audio/video/images

2014-06-02 Thread Philip Ogren
I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: Hi Jamal, If what you want is to process lots of files in parallel, the

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Hi Marcelo, Thanks for the response.. I am not sure I understand. Can you elaborate a bit. So, for example, lets take a look at this example http://pythonvision.org/basic-tutorial import mahotas dna = mahotas.imread('dna.jpeg') dnaf = ndimage.gaussian_filter(dna, 8) But except dna.jpeg Lets

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Thanks. Let me go thru it. On Mon, Jun 2, 2014 at 5:15 PM, Philip Ogren philip.og...@oracle.com wrote: I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On

Window slide duration

2014-06-02 Thread Vadim Chekan
Hi all, I am getting an error: 14/06/02 17:06:32 INFO WindowedDStream: Time 1401753992000 ms is invalid as zeroTime is 1401753986000 ms and slideDuration is 4000 ms and difference is 6000 ms 14/06/02 17:06:32 ERROR OneForOneStrategy: key not found: 1401753992000 ms

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin
The idea is simple. If you want to run something on a collection of files, do (in pseudo-python): def processSingleFile(path): # Your code to process a file files = [ file1, file2 ] sc.parallelize(files).foreach(processSingleFile) On Mon, Jun 2, 2014 at 5:16 PM, jamal sasha

Re: Processing audio/video/images

2014-06-02 Thread jamal sasha
Phoofff.. (Mind blown)... Thank you sir. This is awesome On Mon, Jun 2, 2014 at 5:23 PM, Marcelo Vanzin van...@cloudera.com wrote: The idea is simple. If you want to run something on a collection of files, do (in pseudo-python): def processSingleFile(path): # Your code to process a file

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Mohit Nayak
Hey, Yup that fixed it. Thanks so much! Is this the only solution, or could this be resolved in future versions of Spark ? On Mon, Jun 2, 2014 at 5:14 PM, Sean Owen so...@cloudera.com wrote: If it's the SBT build, I suspect you are hitting https://issues.apache.org/jira/browse/SPARK-1949

Re: Window slide duration

2014-06-02 Thread Tathagata Das
I am assuming that you are referring to the OneForOneStrategy: key not found: 1401753992000 ms error, and not to the previous Time 1401753992000 ms is invalid Those two seem a little unrelated to me. Can you give us the stacktrace associated with the key-not-found error? TD On Mon, Jun 2,

Re: SecurityException when running tests with Spark 1.0.0

2014-06-02 Thread Matei Zaharia
You can just use the Maven build for now, even for Spark 1.0.0. Matei On Jun 2, 2014, at 5:30 PM, Mohit Nayak wiza...@gmail.com wrote: Hey, Yup that fixed it. Thanks so much! Is this the only solution, or could this be resolved in future versions of Spark ? On Mon, Jun 2, 2014 at

Re: Window slide duration

2014-06-02 Thread Vadim Chekan
Ok, it seems like Time ... is invalid is part of normal workflow, when window DStream will ignore RDDs at moments in time when they do not match to the window sliding interval. But why am I getting exception is still unclear. Here is the full stack: 14/06/02 17:21:48 INFO WindowedDStream: Time

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang

Re: Window slide duration

2014-06-02 Thread Tathagata Das
Can you give all the logs? Would like to see what is clearing the key 1401754908000 ms TD On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan kot.bege...@gmail.com wrote: Ok, it seems like Time ... is invalid is part of normal workflow, when window DStream will ignore RDDs at moments in time when

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output format check and overwrite files in the destination directory. But it won't clobber the directory entirely. I.e. if the directory already had part1 part2 part3 part4 and you write a new job outputing only two files (part1,

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Yeah we need to add a build warning to the Maven build. Would you be able to try compiling Spark with Java 6? It would be good to narrow down if you hare hitting this problem or something else. On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen xche...@gmail.com wrote: Nope... didn't try java 6.

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu
I remember that in the earlier version of that PR, I deleted files by calling HDFS API we discussed and concluded that, it’s a bit scary to have something directly deleting user’s files in Spark Best, -- Nan Zhu On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote: (A) Semantics

A single build.sbt file to start Spark REPL?

2014-06-02 Thread Alexy Khrabrov
The usual way to use Spark with SBT is to package a Spark project using sbt package (e.g. per Quick Start) and submit it to Spark using the bin/ scripts from Sark distribution. For plain Scala project, you don’t need to download anything, you can just get a build.sbt file with dependencies and

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nicholas Chammas
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote: (B) Semantics in Spark 1.0 and earlier: Do you mean 1.0 and later? Option (B) with the exception-on-clobber sounds fine to me, btw. My use pattern is probably common but not universal, and deleting user files is indeed

Re: Re: how to construct a ClassTag object as a method parameter in Java

2014-06-02 Thread bluejoe2008
spark 0.9.1 textInput is a JavaRDD object i am programming in Java 2014-06-03 bluejoe2008 From: Michael Armbrust Date: 2014-06-03 10:09 To: user Subject: Re: how to construct a ClassTag object as a method parameter in Java What version of Spark are you using? Also are you sure the type of

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Kexin Xie
+1 on Option (B) with flag to allow semantics in (A) for back compatibility. Kexin On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote: (B) Semantics in Spark 1.0 and earlier: Do

Re: Window slide duration

2014-06-02 Thread Vadim Chekan
Thanks for looking into this Tathagata. Are you looking for traces of ReceiveInputDStream.clearMetadata call? Here is the log: http://wepaste.com/vchekan Vadim. On Mon, Jun 2, 2014 at 5:58 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you give all the logs? Would like to see what

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Good catch! Yes I meant 1.0 and later. On Mon, Jun 2, 2014 at 8:33 PM, Kexin Xie kexin@bigcommerce.com wrote: +1 on Option (B) with flag to allow semantics in (A) for back compatibility. Kexin On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Mon,

Re: using Log4j to log INFO level messages on workers

2014-06-02 Thread Alex Gaudio
Hi, I had the same problem with pyspark. Here's how I resolved it: What I've found in python (not sure about scala) is that if the function being serialized was written in the same python module as the main function, then logging fails. If the serialized function is in a separate module, then