NullPointerException on reading checkpoint files

2014-06-09 Thread Kanwaldeep
I've been running into the following error on creating the streaming context using checkpoint files. This error occurs quite often when I stop and re-start the job. Any suggestions? 14-06-09 23:47:59 WARN CheckpointReader:91 - Error reading checkpoint from file file:/Users/kanwaldeep.dang/git/c

Re: Writing data to HBase using Spark

2014-06-09 Thread Kanwaldeep
Please see sample code attached at https://issues.apache.org/jira/browse/SPARK-944. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-data-to-HBase-using-Spark-tp7304p7305.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Writing data to HBase using Spark

2014-06-09 Thread Vibhor Banga
Hi, I am reading data from a HBase table to RDD and then using foreach on that RDD I am doing some processing on every Result of HBase table. After this processing I want to store the processed data back to another HBase table. How can I do that ? If I use standard Hadoop and HBase classes to wri

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Li

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Nick Pentreath
Can you key your RDD by some key and use reduceByKey? In fact if you are merging bunch of maps you can create a set of (k, v) in your mapPartitions and then reduceByKey using some merge function. The reduce will happen in parallel on multiple nodes in this case. You'll end up with just a single

Re: Spark-Streaming window processing

2014-06-09 Thread Yingjun Wu
Hi Sean, Thanks for your reply! So for the first question, any idea for measuring latency of Spark-Streaming? Regards, Yingjun -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-window-processing-tp7234p7300.html Sent from the Apache Spark Use

Re: Spark SQL standalone application compile error

2014-06-09 Thread Michael Armbrust
You need to add the following to your sbt file: libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0" On Mon, Jun 9, 2014 at 9:25 PM, shlee0605 wrote: > I am having some trouble with compiling Spark standalone application that > uses new Spark SQL feature. > I have used the exact

Spark SQL standalone application compile error

2014-06-09 Thread shlee0605
I am having some trouble with compiling Spark standalone application that uses new Spark SQL feature. I have used the exact same code from "RDDRelation.scala" that is included in Spark's examples. I set my build.sbt file like as follows: name := "spark-sql" version := "1.0" scalaVersion := "2.1

performance difference between spark-shell and spark-submit

2014-06-09 Thread Xu (Simon) Chen
Hi all, I implemented a transformation on hdfs files with spark. First tested in spark-shell (with yarn), I implemented essentially the same logic with a spark program (scala), built a jar file and used spark-submit to execute it on my yarn cluster. The weird thing is that spark-submit approach is

Re: FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
Sorry for the stream of consciousness but after thinking about this a bit more, I'm thinking that the FileNotFoundExceptions are due to tasks being cancelled/restarted and the root cause is the OutOfMemoryError. If anyone has any insights on how to debug this more deeply or relevant config setting

Re: FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
I don't know if this is related but a little earlier in stderr, I also have the following stacktrace. But this stacktrace seems to be when the code is grabbing RDD data from a remote node, which is different from the above. 14/06/09 21:33:26 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaug

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell
I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui wrote: > Hi, > > I'm trying to run the SimpleApp example > (http://sp

FileNotFoundException when using persist(DISK_ONLY)

2014-06-09 Thread Surendranauth Hiraman
I have a dataset of about 10GB. I am using persist(DISK_ONLY) to avoid out of memory issues when running my job. When I run with a dataset of about 1 GB, the job is able to complete. But when I run with the larger dataset of 10 GB, I get the following error/stacktrace, which seems to be happening

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Tobias Pfeiffer
Hi, (Apparently Google Mail is quite eager to send out mails when Ctrl+Enter is hit by accident. Sorry for the previous email.) I remembered I saw this as well and found this ugly comment in my build.sbt file: /* * [...], there is still a problem with some classes * in javax.servlet (from org.e

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Tobias Pfeiffer
Hi, I remembered I saw this as well and found this ugly comment in my build.sbt file: On Mon, Jun 9, 2014 at 11:37 PM, Sean Owen wrote: > Looks like this crept in again from the shaded Akka dependency. I'll > propose a PR to remove it. I believe that remains the way we have to > deal with the d

Re: How to process multiple classification with SVM in MLlib

2014-06-09 Thread littlebird
Thanks. Now I know how to broadcast the dataset but I still wonder after broadcasting the dataset how can I apply my algorithm to training the model in the wokers. To describe my question in detail, The following code is used to train LDA(Latent Dirichlet Allocation) model with JGibbLDA in single

Merging all Spark Streaming RDDs to one RDD

2014-06-09 Thread Henggang Cui
Hi, I'm wondering whether it's possible to continuously merge the RDDs coming from a stream into a single RDD efficiently. One thought is to use the union() method. But using union, I will get a new RDD each time I do a merge. I don't know how I should name these RDDs, because I remember Spark do

SPARK on HPC Podcast

2014-06-09 Thread Brock Palen
I am one half of www.rce-cast.com, a podcast about HPC to educate the community. We would like to feature Spark on the show, and would like to invite a developer or two to spend an hour on the phone/skype with us. Feel free to contact me off list! Brock Palen www.umich.edu/~brockp CAEN Advanced

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Chester Chen
Matei,  Thanks for the insight, we have to carefully consider our design. We are in the processing moving our system to Akka, it would be nice to use Akka all the way. But I understand the limitations.  Thanks Chester On Monday, June 9, 2014 5:06 PM, Matei Zaharia wrote: In genera

SaveAsTextfile per day instead of window?

2014-06-09 Thread Shrikar archak
Hi All, Is there a way to store the streamed data as textfiles per day instead of per window? Thanks, Shrikar

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
In general you probably shouldn’t use actors for processing requests because Spark operations are blocking, and Akka only has a limited thread pool for each ActorSystem. You risk blocking all the threads with ongoing requests and not being able to service new ones. That said though, you can conf

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Chester Chen
Matei,  If we use different Akka actors to process different user's requests, (not different threads), is the SparkContext still safe to use for different users ?  Yes, it would be nice to disable UI via configuration,especially when we develop locally. We use sbt-web plugin to debug tomcat code

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
No, there’s only one UI per SparkContext. On Jun 9, 2014, at 4:43 PM, DB Tsai wrote: > What if there are multiple threads using the same spark context, will > each of thread have it own UI? In this case, it will quickly run out > of the ports. > > Thanks. > > Sincerely, > > DB Tsai > ---

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
What if there are multiple threads using the same spark context, will each of thread have it own UI? In this case, it will quickly run out of the ports. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linke

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Matei Zaharia
You currently can’t have multiple SparkContext objects in the same JVM, but within a SparkContext, all of the APIs are thread-safe so you can share that context between multiple threads. The other issue you’ll run into is that in each thread where you want to use Spark, you need to use SparkEnv.

Re: Errors when building Spark with sbt

2014-06-09 Thread Sean Owen
I see no problems building the assembly right now. "Connection refused" sounds like an error with the remote repo, or your network restrictions. I don't believe this is related to the project itself. On an error retrieving an artifact from one repo, it will try others, some of which don't actually

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
I suppose what I want is the memory efficiency of toLocalIterator and the speed of collect. Is there any such thing? On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung wrote: > Hello, > > I noticed that the final reduce function happens in the driver node with a > code that looks like the followin

Errors when building Spark with sbt

2014-06-09 Thread SK
I tried to use "sbt/sbt assembly" to build spark-1.0.0. I get the a lot of Server access error: Connection refused errors when it tries to download from repo.eclipse.org and repository,jboss.org. I tried to navigate to these links manually and some of these links are obsolete (Error 404). Conside

Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
Hi guys, We would like to use spark hadoop api to get the first couple hundred lines in design time to quickly show users the file-structure/meta data, and the values in those lines without launching the full spark job in cluster. Since we're web-based application, there will be multiple users us

Setting spark memory limit

2014-06-09 Thread Henggang Cui
Hi, I'm trying to run the SimpleApp example ( http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala) on a larger dataset. The input file is about 1GB, but when I run the Spark program, it says:"java.lang.OutOfMemoryError: GC overhead limit exceeded", the full error output

Spilled shuffle files not being cleared

2014-06-09 Thread Michael Chang
Hi all, I'm seeing exceptions that look like the below in Spark 0.9.1. It looks like I'm running out of inodes on my machines (I have around 300k each in a 12 machine cluster). I took a quick look and I'm seeing some shuffle spill files that are around even around 12 minutes after they are creat

Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
Hello, I noticed that the final reduce function happens in the driver node with a code that looks like the following. val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) { a.merge(b) } although individual outputs from mappers are small. Over time the aggregated result outputMap co

Spark Streaming application not working on EC2 Cluster

2014-06-09 Thread Gianluca Privitera
Hi, I'm think I may have encountered some kind of bug that at the moment prevents the correct running of my application on a EC2 Cluster. I'm saying that because the same exact code works wonderfully locally but has a really strange behaviour on the cluster. val uri = ssc.textFileStream(args(1) +

Spark 0.9.1 - saveAsTextFile() exception: _temporary doesn't exist!

2014-06-09 Thread Oleg Proudnikov
Hi All, After a few simple transformations I am trying to save to a local file system. The code works in local mode but not on a standalone cluster. The directory *1.txt/_temporary* does exist after the exception. I would appreciate any suggestions. *scala> d3.sample(false,0.01,1).map( pair

Re: Spark SQL JDBC Connectivity and more

2014-06-09 Thread Michael Armbrust
> [Venkat] Are you saying - pull in the SharkServer2 code in my standalone > spark application (as a part of the standalone application process), pass > in > the spark context of the standalone app to SharkServer2 Sparkcontext at > startup and viola we get a SQL/JDBC interfaces for the RDDs of t

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Hi Matei, Yeah you are right this is very niche (my user case is as a web crawler), but I glad you also like an additional property. Let me open a JIRA. Yours Peng On Mon 09 Jun 2014 03:08:29 PM EDT, Matei Zaharia wrote: If this is a useful feature for local mode, we should open a JIRA to doc

Re: How to enable fault-tolerance?

2014-06-09 Thread Matei Zaharia
If this is a useful feature for local mode, we should open a JIRA to document the setting or improve it (I’d prefer to add a spark.local.retries property instead of a special URL format). We initially disabled it for everything except unit tests because 90% of the time an exception in local mode

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Oh, and to make things worse, they forgot '\*' in their regex. Am I the first to encounter this problem before? On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote: Thanks a lot! That's very responsive, somebody definitely has encountered the same problem before, and added two hidden modes in m

Re: Spark SQL JDBC Connectivity and more

2014-06-09 Thread Venkat Subramanian
1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark, the RDD (tables) are created within Shark's spark context and processed. This is not possible out

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Thanks a lot! That's very responsive, somebody definitely has encountered the same problem before, and added two hidden modes in masterURL: (from SparkContext.scala: line1431) // Regular expression for local[N, maxRetries], used in tests with failing tasks val LOCAL_N_FAILURES_REGEX =

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread sam
Any idea when they will release it? Also I'm uncertain what we will need to do to fix the shell? Will we have to reinstall spark? or reinstall hadoop? (i'm not a devops so maybe this question sounds silly) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Arr

Re: How to process multiple classification with SVM in MLlib

2014-06-09 Thread Xiangrui Meng
For broadcast data, please read http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables . For one-vs-all, please read https://en.wikipedia.org/wiki/Multiclass_classification . -Xiangrui On Mon, Jun 9, 2014 at 7:24 AM, littlebird wrote: > Thank you for your reply, I don't q

Re: How to enable fault-tolerance?

2014-06-09 Thread Aaron Davidson
Looks like your problem is local mode: https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430 For some reason, someone decided not to do retries when running in local mode. Not exactly sure why, feel free to submi

Re: Occasional failed tasks

2014-06-09 Thread Peng Cheng
I think these failed task must got retried automatically if you can't see any error in your results. Other wise the entire application will throw a SparkException and abort. Unfortunately I don't know how to do this, my application always abort. -- View this message in context: http://apache-s

Re: Failed to remove RDD error

2014-06-09 Thread Michael Chang
Unfortunately don't have any logs at the moment. I will post them here if they occur again! On Tue, Jun 3, 2014 at 11:04 AM, Tathagata Das wrote: > It was not intended to be experimental as this improves general > performance. We tested the feature since 0.9, and didnt see any problems. > We n

Re: implementing the VectorAccumulatorParam

2014-06-09 Thread Sean Owen
(BCC dev@) The example is out of date with respect to current Vector class. The zeros() method is on "Vectors". There is not currently a += operation for Vector anymore. To be fair the example doesn't claim this illustrates use of the Spark Vector class but it did work with the now-deprecated Vec

Re: implementing the VectorAccumulatorParam

2014-06-09 Thread Sean Owen
(The user@ list might be a bit better but I can see why it might look like a dev@ question.) Did you import org.apache.spark.mllib.linalg.Vector ? I think you are picking up Scala's Vector class instead. On Mon, Jun 9, 2014 at 11:57 AM, dataginjaninja wrote: > The programming-guide >

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
I speculate that Spark will only retry on exceptions that are registered with TaskSetScheduler, so a definitely-will-fail task will fail quickly without taking more resources. However I haven't found any documentation or web page on it -- View this message in context: http://apache-spark-user-l

Re: spark1.0 spark sql saveAsParquetFile Error

2014-06-09 Thread Peng Cheng
I wasn't using spark sql before. But by default spark should retry the exception for 4 times. I'm curious why it aborted after 1 failure -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html Sent from the

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Sean Owen
Looks like this crept in again from the shaded Akka dependency. I'll propose a PR to remove it. I believe that remains the way we have to deal with the differing Netty/Jetty dependencies floating around. On Mon, Jun 9, 2014 at 9:53 AM, toivoa wrote: > I am using Maven from Eclipse > > dependency:

Re: How to process multiple classification with SVM in MLlib

2014-06-09 Thread littlebird
Thank you for your reply, I don't quite understand how to do one-vs-all manually for multiclass training. And for the second question, My algorithm is implemented in Java and designed for single machine, How can I broadcast the dataset to each worker, train models on workers? Thank you very much.

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread toivoa
I am using Maven from Eclipse dependency:tree shows [INFO] +- org.apache.spark:spark-core_2.10:jar:1.0.0:compile [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:runtime [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Sean Owen
On the surface it sounds again like https://issues.apache.org/jira/browse/SPARK-1949 but not sure it's exactly the same problem. The build exclusions are intended to remove dependency on the old Netty artifact. I think the Maven build has it sorted out though. Are you using Maven? if so what do yo

Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread toivoa
Using org.apache.spark spark-core_2.10 1.0.0 I can create simple test and run under Eclipse. But when I try to deploy on test server I have dependencies problems. 1. Spark requires akka-remote_2.10 2.2.3-shaded-

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi Sean, Thank you for the fast response.   Thanks & Regards, Meethu M On Monday, 9 June 2014 6:04 PM, Sean Owen wrote: Have a search online / at the Spark JIRA. This was a known upstream bug in Hadoop. https://issues.apache.org/jira/browse/SPARK-1861 On Mon, Jun 9, 2014 at 7:54 AM, MEE

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread Sean Owen
Have a search online / at the Spark JIRA. This was a known upstream bug in Hadoop. https://issues.apache.org/jira/browse/SPARK-1861 On Mon, Jun 9, 2014 at 7:54 AM, MEETHU MATHEW wrote: > Hi, > I am getting ArrayIndexOutOfBoundsException while reading from bz2 files in > HDFS.I have come across

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi Akhil, Plz find the code below.  x = sc.textFile("hdfs:///**")  x = x.filter(lambda z:z.split(",")[0]!=' ')  x = x.filter(lambda z:z.split(",")[3]!=' ')  z = x.reduce(add)   Thanks & Regards, Meethu M On Monday, 9 June 2014 5:52 PM, Akhil Das wrote: Can you paste the piece of code!?

Re: ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread Akhil Das
Can you paste the piece of code!? Thanks Best Regards On Mon, Jun 9, 2014 at 5:24 PM, MEETHU MATHEW wrote: > Hi, > I am getting ArrayIndexOutOfBoundsException while reading from bz2 files > in HDFS.I have come across the same issue in JIRA at > https://issues.apache.org/jira/browse/SPARK-1861

Re: Spark-Streaming window processing

2014-06-09 Thread Sean Owen
To your second question, that is what the 'invFunc' in reduceByKeyAndWindow() does. If you can supply an "un-reduce" function the windows can be updated rather than recomputed each time. On Mon, Jun 9, 2014 at 5:39 AM, Yingjun Wu wrote: > Dear all, > > I just run the window processing job using S

ArrayIndexOutOfBoundsException when reading bzip2 files

2014-06-09 Thread MEETHU MATHEW
Hi, I am getting ArrayIndexOutOfBoundsException while reading from bz2 files  in HDFS.I have come across the same issue in JIRA at  https://issues.apache.org/jira/browse/SPARK-1861, but it seems to be resolved.  I have tried the workaround suggested(SPARK_WORKER_CORES=1),but its still showing err

Re: mllib, python and SVD

2014-06-09 Thread Nick Pentreath
Don't think SVD is exposed via MLlib in Python yet, but you can also check out: https://github.com/ogrisel/spylearn where Jeremy Freeman put together a numpy-based SVD algorithm (this is a bit outdated but should still work I assume) (also https://github.com/freeman-lab/thunder has a PCA implement

Spark-Streaming window processing

2014-06-09 Thread Yingjun Wu
Dear all, I just run the window processing job using Spark-Streaming, and I have two questions. First, how can I measure the latency of Spark-Streaming? Is there any APIs that I can call directly? Second, is it true that the latency of Spark-Streaming grows linearly with the window size? It seems

mllib, python and SVD

2014-06-09 Thread Håvard Wahl Kongsgård
Hi, is it possible to do Singular value decomposition (SVD) with python in spark(1.0.0)? -Havard WK

Re: Classpath errors with Breeze

2014-06-09 Thread dlaw
Thanks Xiangrui, that did the trick. Dieterich On Jun 8, 2014, at 10:17 PM, Xiangrui Meng [via Apache Spark User List] wrote: > Hi dlaw, > > You are using breeze-0.8.1, but the spark assembly jar depends on > breeze-0.7. If the spark assembly jar comes the first on the classpath > but the

how to improve sharkserver2's parallelism performance?

2014-06-09 Thread qingyang li
if i have to sumbmit a job which will cost 3 seconds, how many jobs at the same time sharkserver2 could handled? how to improve sharkserver2's parallelism performance?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-09 Thread Aaron Davidson
It is not a very good idea to save the results in the exact same place as the data. Any failures during the job could lead to corrupted data, because recomputing the lost partitions would involve reading the original (now-nonexistent) data. As such, the only "safe" way to do this would be to do as