Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Vibhor Banga
Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga

Re: Failed to remove RDD error

2014-05-31 Thread Mayur Rustagi
You can increase your akka timeout, should give you some more life.. are you running out of memory by any chance? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 6:52 AM, Michael Chang

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Xiangrui Meng
The documentation you looked at is not official, though it is from @pwendell's website. It was for the Spark SQL release. Please find the official documentation here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm It contains a working example

Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-05-31 Thread Xiangrui Meng
Hi Tobias, One hack you can try is: rdd.mapPartitions(iter = { val x = new X() iter.map(row = x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty } }) Best, Xiangrui On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I want to use an object x in my RDD

Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode - local mode - mesos cluster with single node - spark cluster with multi node Schema creation and

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
Hi there, Patrick. Thanks for the reply... It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like that. :-) Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat, SUSE) So, users

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
Hey There, You can remove an accumulator by just letting it go out of scope and it will be garbage collected. For broadcast variables we actually store extra information for it, so we provide hooks for users to remove the associated state. There is no such need for accumulators, though. -

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM,

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah, I think you are correct. 2. In that same code example the object sqlCtx is

Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a non preferred location after `spark.locality.wait` has expired.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls

Re: Trouble with EC2

2014-05-31 Thread Matei Zaharia
What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Nicholas Chammas
That's a neat idea. I'll try that out. On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell pwend...@gmail.com wrote: I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile().

hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
I'm running the following code to load an entire directory of Avros using hadoopRDD. val input = hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/* // Setup the path for the job vai a Hadoop JobConf val jobConf= new JobConf(sc.hadoopConfiguration) jobConf.setJobName(Test Scala Job)

can not access app details on ec2

2014-05-31 Thread wxhsdp
hi, all i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything goes well except that i can not access application details on the web ui, i just click on the application name, but there's not response, has anyone met this before? is this a bug? thanks! -- View this

spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Yadid Ayzenberg
Yep, I just issued a pull request. Yadid On 5/31/14, 1:25 PM, Patrick Wendell wrote: 1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah,

Spark on EC2

2014-05-31 Thread superback
Hi, I am trying to run an example on AMAZON EC2 and have successfully set up one cluster with two nodes on EC2. However, when I was testing an example using the following command, * ./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077* I got the following

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)