java.io.IOException: Failed to save output of task

2014-05-22 Thread Grega Kešpret
Hello, my last reduce task in the job always fails with java.io.IOException: Failed to save output of task when using saveAsTextFile with s3 endpoint (all others are successful). Has anyone had similar problems? https://gist.github.com/gregakespret/813b540faca678413ad4 - 14/05/21

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
Hi Nick, Here is an illustrated example which extracts certain fields from Facebook messages, each one is a JSON object and they are serialised into files with one complete JSON object per line. Example of one such message: CandyCrush.json https://gist.github.com/cotdp/131a1c9fc620ab7898c4 You

Spark Streaming on Mesos, various questions

2014-05-22 Thread Tobias Pfeiffer
Hi, with the hints from Gerard I was able to get my locally working Spark code running on Mesos. Thanks! Basically, on my local dev machine, I use sbt assembly to create a fat jar (which is actually not so fat since I use ... % 'provided' in my sbt file for the Spark dependencies), upload it to

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-22 Thread Kevin Markey
Update:  Partly user error.  But still getting FS closed error. Yes, we are running plain vanilla Hadoop 2.3.0.  But it probably doesn't matter 1. Tried Colin McCabe's suggestion to patch with pull 850 (https://issues.apache.org/jira/browse/SPARK-1898).  No

Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Limbeck, Philip
HI! We are currently using HBase as our primary data store of different event-like data. On-top of that, we use Shark to aggregate this data and keep it in memory for fast data access. Since we use no specific HBase functionality whatsoever except Putting data into it, a discussion came up on

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath
Hi In my opinion, running HBase for immutable data is generally overkill in particular if you are using Shark anyway to cache and analyse the data and provide the speed. HBase is designed for random-access data patterns and high throughput R/W activities. If you are only ever writing immutable

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Flavio Pompermaier
Is there a way to query fields by similarity (like Lucene or using a similarity metric) to be able to query something like WHERE language LIKE it~0.5 ? Best, Flavio On Thu, May 22, 2014 at 8:56 AM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Here is an illustrated example which

Spark Streaming Error: SparkException: Error sending message to BlockManagerMaster

2014-05-22 Thread Sourav Chandra
Hi, I am running Spark streaming application. I have faced some uncaught exception after which my worker stops processing any further messages. I am using *spark 0.9.0* Could you please let me know what could be the cause of this and how to overcome this issue? [ERROR] [05/22/2014

SparkContext#stop

2014-05-22 Thread Piotr Kołaczkowski
Hi, We observed strange behabiour of Spark 0.9.0 when using sc.stop(). We have a bunch of applications that perform some jobs and then issue sc.stop() at the end of main. Most of the time, everything works as desired, but sometimes the applications get marked as FAILED by the master and all

Workers disconnected from master sometimes and never reconnect back

2014-05-22 Thread Piotr Kołaczkowski
Hi, Another problem we observed that on a very heavily loaded cluster, if the worker fails to respond to the heartbeat within 60 seconds, it gets disconnected permanently from the master and never connects back again. It is very easy to reproduce - just setup a spark standalone cluster on a

Re: SparkContext#stop

2014-05-22 Thread Andrew Or
You should always call sc.stop(), so it cleans up state and does not fill up your disk over time. The strange behavior you observe is mostly benign, as it only occurs after you have supposedly finished all of your work with the SparkContext. I am not aware of a bug in Spark that causes this

Re: SparkContext#stop

2014-05-22 Thread Piotr Kołaczkowski
No exceptions in any logs. No errors in stdout or stderr. 2014-05-22 11:21 GMT+02:00 Andrew Or and...@databricks.com: You should always call sc.stop(), so it cleans up state and does not fill up your disk over time. The strange behavior you observe is mostly benign, as it only occurs after

how to set task number?

2014-05-22 Thread qingyang li
i am using tachyon as storage system and using to shark to query a table which is a bigtable, i have 5 machines as a spark cluster, there are 4 cores on each machine . My question is: 1. how to set task number on each core? 2. where to see how many partitions of one RDD?

Re: Ignoring S3 0 files exception

2014-05-22 Thread Laurent Thoulon
Hi Mayur, Thanks for your help. I'm not sure I understand what are the parameters i must give to newAPIHadoopFile [ K , V , F : InputFormat [ K , V ] ] ( path: String , fClass: Class [ F ] , kClass: Class [ K ] , vClass: Class [ V ] , conf: Configuration ) : JavaPairRDD [ K , V ] It seems

Re: how to set task number?

2014-05-22 Thread qingyang li
i have added SPARK_JAVA_OPTS+=-Dspark.default.parallelism=40 in shark-env.sh 2014-05-22 17:50 GMT+08:00 qingyang li liqingyang1...@gmail.com: i am using tachyon as storage system and using to shark to query a table which is a bigtable, i have 5 machines as a spark cluster, there are 4

Re: how to set task number?

2014-05-22 Thread qingyang li
my aim of setting task number is to increase the query speed,and I have also found mapPartitionsWithIndex at Operator.scala:333http://192.168.1.101:4040/stages/stage?id=17 is costing much time. so, my another question is : how to tunning

Drop shark Cache

2014-05-22 Thread vinay Bajaj
Hi Is there any way or command by which we can wipe/drop whole cache of shark in one go. Thanks Vinay Bajaj

reading task failed 4 times, for unknown reason

2014-05-22 Thread Kostakis, Orestis
We have an instance of Spark running on top of Mesos and GlusterFS. Due to some fixes of bugs that we also came across, we installed the latest versions: 1.0.0-rc9 (spark-1.0.0-bin-2.0.5-alpha, java 1.6.0_27), Mesos 0.18.1. Since then, moderate sized tasks (10-20GB) cannot complete. I notice

spark setting maximum available memory

2014-05-22 Thread İbrahim Rıza HALLAÇ
In my situation each slave has 8 GB memory. I want to use the maximum memory that I can: .set(spark.executor.memory, ?g) How can I determine the amount of memory I should set ? It fails when I set it to 8GB.

ETL and workflow management on Spark

2014-05-22 Thread William Kang
Hi, We are moving into adopting the full stack of Spark. So far, we have used Shark to do some ETL work, which is not bad but is not prefect either. We ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig. Do you have any suggestions with the ETL solution in Spark stack? And

Computing cosine similiarity using pyspark

2014-05-22 Thread jamal sasha
Hi, I have bunch of vectors like [0.1234,-0.231,0.23131] and so on. and I want to compute cosine similarity and pearson correlation using pyspark.. How do I do this? Any ideas? Thanks

Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Is there a simple way to monitor the overall progress of an action using SparkListener or anything else? I see that one can name an RDD... Could that be used to determine which action triggered a stage, ... ? Thanks Pierre -- View this message in context:

Re: ETL and workflow management on Spark

2014-05-22 Thread Derek Schoettle
unsubscribe From: William Kang weliam.cl...@gmail.com To: user@spark.apache.org Date: 05/22/2014 10:50 AM Subject:ETL and workflow management on Spark Hi, We are moving into adopting the full stack of Spark. So far, we have used Shark to do some ETL work, which is not bad

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
SparkListener offers good stuffs. But I also completed it with another metrics stuffs on my own that use Akka to aggregate metrics from anywhere I'd like to collect them (without any deps on ganglia yet on Codahale). However, this was useful to gather some custom metrics (from within the tasks

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Hi Andy! Yes Spark UI provides a lot of interesting informations for debugging purposes. Here I’m trying to integrate a simple progress monitoring in my app ui. I’m typically running a few “jobs” (or rather actions), and I’d like to be able to display the progress of each of those in my ui. I

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
Yeah, actually for that I used directly codahale with my own stuffs using the Akka system from within Spark itself. So the workers send messages back to a bunch of actors on the driver which are using codahale metrics. This way I can collect what/how an executor do/did, but also I can aggregate

Shark resilience to unusable slaves

2014-05-22 Thread Yana Kadiyska
Hi, I am running into a pretty concerning issue with Shark (granted I'm running v. 0.8.1). I have a Spark slave node that has run out of disk space. When I try to start Shark it attempts to deploy the application to a directory on that node, fails and eventually gives up (I see a Master Removed

controlling the time in spark-streaming

2014-05-22 Thread Ian Holsman
Hi. I'm writing a pilot project, and plan on using spark's streaming app for it. To start with I have a dump of some access logs with their own timestamps, and am using the textFileStream and some old files to test it with. One of the issues I've come across is simulating the windows. I would

Re: ETL and workflow management on Spark

2014-05-22 Thread Mayur Rustagi
Hi, We are in process of migrating Pig on spark. What is your currrent Spark setup? Version cluster management that you use? Also what is the datasize you are working with right now. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: logging in pyspark

2014-05-22 Thread Shivani Rao
I am having trouble adding logging to the class that does serialization and deserialization. Where is the code for org.apache.spark.Logging located? and is this serializable? On Mon, May 12, 2014 at 10:02 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, yes, that is correct. You

Akka disassociation on Java SE Embedded

2014-05-22 Thread Chanwit Kaewkasi
Hi all, On an ARM cluster, I have been testing a wordcount program with JRE 7 and everything is OK. But when changing to the embedded version of Java SE (Oracle's eJRE), the same program cannot complete all computing stages. It is failed by many Akka's disassociation. - I've been trying to

How to turn off MetadataCleaner?

2014-05-22 Thread Adrian Mocanu
Hi After using sparks TestSuiteBase to run some tests I've noticed that at the end, after finishing all tests the cleaner is still running and outputs the following perdiodically: INFO o.apache.spark.util.MetadataCleaner - Ran metadata cleaner for SHUFFLE_BLOCK_MANAGER I use method

Fwd: Spark Streaming: Flume stream not found

2014-05-22 Thread Andy Konwinski
I'm forwarding this email along which contains a question from a Spark user Adrien (CC'd) who can't successfully get any emails through to the Apache mailing lists. Please reply-all when responding to include Adrien. See below for his question. -- Forwarded message -- From: Adrien

Re: java.io.IOException: Failed to save output of task

2014-05-22 Thread Grega Kešpret
I have since resolved the issue. The problem was that multiple rdds were trying to write to the same s3 bucket. Grega -- [image: Inline image 1] *Grega Kešpret* Analytics engineer Celtra — Rich Media Mobile Advertising celtra.com http://www.celtra.com/ |

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
I am not 100% sure of the functionality in Catalyst, probably the easiest way to see what it supports is to look at SqlParser.scalahttps://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scalain GIT. Straight away I can see LIKE, RLIKE and

Spark Streaming with Kafka | Check if DStream is Empty | HDFS Write

2014-05-22 Thread Anish Sneh
Hi All I am using Spark Streaming with Kafka, I recieve messages and after minor processing I write them to HDFS, as of now I am using saveAsTextFiles() / saveAsHadoopFiles() Java methods - Is there some default way of writing stream to Hadoop like we have HDFS sink concept in Flume? I mean

Spark Job Server first steps

2014-05-22 Thread Gerard Maas
Hi, I'm starting to explore the Spark Job Server contributed by Ooyala [1], running from the master branch. I started by developing and submitting a simple job and the JAR check gave me errors on a seemingly good jar. I disabled the fingerprint checking on the jar and I could submit it, but

Re: Spark Job Server first steps

2014-05-22 Thread Michael Cutler
Hi Gerard, We're using the Spark Job Server in production, from GitHub [master] running against a recent Spark-1.0 snapshot so it definitely works. I'm afraid the only time we've seen a similar error was an unfortunate case of PEBKAC http://en.wikipedia.org/wiki/User_error. First and foremost,

Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
Hey all, I'm working through the basic SparkPi example on a YARN cluster, and i'm wondering why my containers don't pick up the spark assembly classes. I built the latest spark code against CDH5.0.0 Then ran the following:

Re: Spark / YARN classpath issues

2014-05-22 Thread Andrew Or
Hi Jon, Your configuration looks largely correct. I have very recently confirmed that the way you launch SparkPi also works for me. I have run into the same problem a bunch of times. My best guess is that this is a Java version issue. If the Spark assembly jar is built with Java 7, it cannot be

Re: Error while launching ec2 spark cluster with HVM (r3.large)

2014-05-22 Thread adparker
I had this problem too and fixed it by setting the wait time-out to a larger value: --wait For example, in stand alone mode with default values, a time out of 480 seconds worked for me: $ cd spark-0.9.1/ec2 $ ./spark-ec2 --key-pair= --identity-file= --instance-type=r3.large --wait=480

How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
I want to run the LR, SVM, and NaiveBayes algorithms implemented in the following directory on my data set. But I did not find the sample command line to run them. Anybody help? Thanks. spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification -- View this message in

Re: How to Run Machine Learning Examples

2014-05-22 Thread Stephen Boesch
There is a bin/run-example.sh example-class [args] 2014-05-22 12:48 GMT-07:00 yxzhao yxz...@ualr.edu: I want to run the LR, SVM, and NaiveBayes algorithms implemented in the following directory on my data set. But I did not find the sample command line to run them. Anybody help? Thanks.

Re: spark setting maximum available memory

2014-05-22 Thread Andrew Or
Hi Ibrahim, If your worker machines only have 8GB of memory, then launching executors with all the memory will leave no room for system processes. There is no guideline, but I usually leave around 1GB just to be safe, so conf.set(spark.executor.memory, 7g) Andrew 2014-05-22 7:23 GMT-07:00

Re: Spark Job Server first steps

2014-05-22 Thread Gerard Maas
Hi Michael, Thanks for the tip on the /tmp dir. I had unzipped all the jars before uploading to check for the class. The issue is that the jars were not uploaded correctly. I was not familiar with the '@' syntax of curl and omitted it, resulting in a Jar file containing only the jar's name. curl

Re: Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
Andrew, Brilliant! I built on Java 7 but was still running our cluster on Java 6. Upgraded the cluster and it worked (with slight tweaks to the args, I guess the app args come first then yarn-standalone comes last):

Re: How to Run Machine Learning Examples

2014-05-22 Thread Marco Shaw
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with there Spark packages and none seem to package it. Am I missing something? Is this only provided with the Spark project pre-built binaries or from source installs? Marco On May 22, 2014, at 5:04 PM, Stephen

Re: spark setting maximum available memory

2014-05-22 Thread Mayur Rustagi
Ideally you should use less.. 75 % would be good to leave enough for scratch space for shuffle writes system processes. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, May 23, 2014 at 1:41 AM, Andrew Or

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Mohit Jaggi
Andrew, I did not register anything explicitly based on the belief that the class name is written out in full only once. I also wondered why that problem would be specific to JodaTime and not show up with Java.util.date...I guess it is possible based on internals of Joda time. If I remove DateTime

Re: Spark / YARN classpath issues

2014-05-22 Thread Andrew Or
I think you should be able to drop yarn-standalone altogether. We recently updated SparkPi to take in 1 argument (num slices, which you set to 10). Previously, it took in 2 arguments, the master and num slices. Glad you got it figured out. 2014-05-22 13:41 GMT-07:00 Jon Bender

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Andrew Ash
Spark uses the Twitter Chill library, which registers a bunch of core Scala and Java classes by default. I'm assuming that java.util.Date is automatically registered by that, but Joda's DateTime is not. We could always take a look through the source to confirm too. As far as the class name, my

Re: How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
Thanks Stephen, I used the following commnad line to run the SVM, but it seems that the path is not correct. What the right path or command line should be? Thanks. *./bin/run-example org.apache.spark.mllib.classification.SVM spark://100.1.255.193:7077 http://100.1.255.193:7077 train.csv 20*

Re: How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
Thanks. I used the following commnad line to run the SVM, but it seems that the path is not correct. What the right path or command line should be? Thanks. ./bin/run-example org.apache.spark.mllib.classification.SVM spark://100.1.255.193:7077 train.csv 20 Exception in thread main

Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Hi All, I am trying to run the network count example as a seperate standalone job and running into some issues. Environment: 1) Mac Mavericks 2) Latest spark repo from Github. I have a structure like this Shrikars-MacBook-Pro:SimpleJob shrikar$ find . . ./simple.sbt ./src ./src/main

Re: accessing partition i+1 from mapper of partition i

2014-05-22 Thread Austin Gibbons
Mohit, if you want to end up with (1 .. N) , why don't you skip the logic for finding missing values, and generate it directly? val max = myCollection.reduce(math.max) sc.parallelize((0 until max)) In either case, you don't need to call cache, which will force it into memory - you can do

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you launching the application? sbt run ? spark-submit? local mode or Spark standalone cluster? Are you packaging all your code into a jar? Looks to me that you seem to have spark classes in your execution environment but missing some of Spark's dependencies. TD On Thu, May 22, 2014 at

Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
I am running as sbt run. I am running it locally . Thanks, Shrikar On Thu, May 22, 2014 at 3:53 PM, Tathagata Das tathagata.das1...@gmail.comwrote: How are you launching the application? sbt run ? spark-submit? local mode or Spark standalone cluster? Are you packaging all your code into a

Re: Error while launching ec2 spark cluster with HVM (r3.large)

2014-05-22 Thread Xiangrui Meng
Was the error message the same as you posted when you used `root` as the user id? Could you try this: 1) Do not specify user id. (Default would be `root`.) 2) If it fails in the middle, try `spark-ec2 --resume launch cluster` to continue launching the cluster. Best, Xiangrui On Thu, May

Re: How to Run Machine Learning Examples

2014-05-22 Thread Krishna Sankar
I couldn't find the classification.SVM class. - Most probably the command is something of the order of: - bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv - For more details

Re: How to turn off MetadataCleaner?

2014-05-22 Thread Tathagata Das
The cleaner should remain up while the sparkcontext is still active (not stopped). However, here it seems you are stopping the sparkContext (ssc.stop(true)), the cleaner should be stopped. However, there was a bug earlier where some of the cleaners may not have been stopped when the context is

Re: accessing partition i+1 from mapper of partition i

2014-05-22 Thread Mohit Jaggi
Austin, I made up a mock example...my real use case is more complex. I used foreach() instead of collect/cache..that forces the accumulable to be evaluated. On another thread Xiangrui pointed me to a sliding window rdd in mlllib that is a great alternative (although I did not switch to using it)

Broadcast Variables

2014-05-22 Thread Puneet Lakhina
Hi, Im confused on what is the right way to use broadcast variables from java. My code looks something like this: Map val = //build Map to be broadcast BroadcastMap broadastVar = sc.broadcast(val); sc.textFile(...).map(new SomeFunction()) { //Do something here using broadcastVar } My

Unsubscribe

2014-05-22 Thread Donna-M Fernandez
Unsubscribe

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you publish Spark locally which allowed you to use it as a dependency? This is a weird indeed. SBT should take care of all the dependencies of spark. In any case, you can try the last released Spark 0.9.1 and see if the problem

Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1. Thanks, Shrikar On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.comwrote: How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you publish Spark locally which allowed you to use it as a

Re: Unable to run a Standalone job

2014-05-22 Thread Soumya Simanta
Try cleaning your maven (.m2) and ivy cache. On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote: Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1. Thanks, Shrikar On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote:

java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-22 Thread prabeesh k
Hi, I am trying to apply inner join in shark using 64MB and 27MB files. I am able to run the following queris on Mesos - SELECT * FROM geoLocation1 - SELECT * FROM geoLocation1 WHERE country = 'US' But while trying inner join as SELECT * FROM geoLocation1 g1 INNER JOIN

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread Chester
This is something we are interested as well. We are planning to investigate more on this. If someone has suggestions, we would love to hear. Chester Sent from my iPad On May 22, 2014, at 8:02 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi Andy! Yes Spark UI provides a

Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Hi, I tried clearing maven and ivy cache and I am a bit confused at this point in time. 1) Running the example from the spark directory and running using bin/run-example. It works fine as well as it prints the word counts. 2) Trying to run the same code as a seperate job. *) Using the latest