Spark + Mesos + HDFS resource split

2015-04-27 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am building a mesos cluster for the purposes of using it to run spark workloads (in addition to other frameworks). I am under the impression that it is preferable/recommended to run hdfs datanode process, spark slave on the same physical node

Re: Cassandra Connection Issue with Spark-jobserver

2015-04-27 Thread Noorul Islam K M
Are you using DSE spark, if so are you pointing spark job server to use DSE spark? Thanks and Regards Noorul Anand anand.vi...@monotype.com writes: *I am new to Spark world and Job Server My Code :* package spark.jobserver import java.nio.ByteBuffer import

spark-defaults.conf

2015-04-27 Thread James King
I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file.

Disable partition discovery

2015-04-27 Thread Cosmin Cătălin Sanda
How can one disable *Partition discovery* in *Spark 1.3.0 *when using *sqlContext.parquetFile*? Alternatively, is there a way to load *.parquet* files without *Partition discovery*? Cosmin

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
FYI, Parquet schema output: message pig_schema { optional binary cust_id (UTF8); optional int32 part_num; optional group ip_list (LIST) { repeated group ip_t { optional binary ip (UTF8); } } optional group vid_list (LIST) { repeated group vid_t { optional binary

How to distribute Spark computation recipes

2015-04-27 Thread Olivier Girardot
Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to serialize/deserialize full RDD computations ? @rxin Spark SQL is, in a way, already doing this but the parsers are private[sql], is there any way to

Re: Spark streaming action running the same work in parallel

2015-04-27 Thread ColinMc
I was able to get it working. Instead of using customers.flatMap to return alerts. I had to use the following: customers.foreachRDD(new FunctionJavaPairRDDlt;String, Iterablelt;QueueEvent, Void() { @Override public Void call(final JavaPairRDDString, Iterablelt;QueueEvent

Re: spark-defaults.conf

2015-04-27 Thread Zoltán Zvara
You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to

Spark 1.2.1: How to convert SchemaRDD to CassandraRDD?

2015-04-27 Thread Tash Chainar
Hi all, following the import com.datastax.spark.connector.SelectableColumnRef; import com.datastax.spark.connector.japi.CassandraJavaUtil; import org.apache.spark.sql.SchemaRDD; import static com.datastax.spark.connector.util.JavaApiHelper.toScalaSeq; import scala.collection.Seq; SchemaRDD

Re: Cassandra Connection Issue with Spark-jobserver

2015-04-27 Thread Anand
I was able to fix the issues by providing right version of cassandra-all and thrift libraries -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-Connection-Issue-with-Spark-jobserver-tp22587p22664.html Sent from the Apache Spark User List mailing

ReduceByKey and sorting within partitions

2015-04-27 Thread Marco
Hi, I'm trying, after reducing by key, to get data ordered among partitions (like RangePartitioner) and within partitions (like sortByKey or repartitionAndSortWithinPartition) pushing the sorting down to the shuffles machinery of the reducing phase. I think, but maybe I'm wrong, that the correct

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-27 Thread TimMalt
Hi, and what can I do when I am on Windows? It does not allow me to set the hostname to some IP Thanks, Tim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22674.html Sent

Re: hive-thriftserver maven artifact

2015-04-27 Thread Ted Yu
This is available for 1.3.1: http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 FYI On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote: Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu

Re: Group by order by

2015-04-27 Thread Richard Marscher
It's not related to Spark, but the concept of what you are trying to do with the data. Grouping by ID means consolidating data for each ID down to 1 row per ID. You can sort by time after this point yes, but you would need to either take each ID and time value pair OR do some aggregate operation

bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-04-27 Thread Pagliari, Roberto
With the Python APIs, the available arguments I got (using inspect module) are the following: ['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights', 'regParam', 'regType', 'intercept'] numClasses is not available. Can someone comment on this? Thanks,

Automatic Cache in SparkSQL

2015-04-27 Thread Wenlei Xie
Hi, I am trying to answer a simple query with SparkSQL over the Parquet file. When execute the query several times, the first run will take about 2s while the later run will take 0.1s. By looking at the log file it seems the later runs doesn't load the data from disk. However, I didn't enable

Fwd: Change ivy cache for spark on Windows

2015-04-27 Thread Burak Yavuz
+user -- Forwarded message -- From: Burak Yavuz brk...@gmail.com Date: Mon, Apr 27, 2015 at 1:59 PM Subject: Re: Change ivy cache for spark on Windows To: mj jone...@gmail.com Hi, In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set: `spark.jars.ivy \your\path`

Re: Super slow caching in 1.3?

2015-04-27 Thread Christian Perez
Michael, There is only one schema: both versions have 200 string columns in one file. On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov evo.efti...@isecc.com wrote: Now this is very important: “Normal RDDs” refers to “batch RDDs”. However the default in-memory Serialization of RDDs which are

Change ivy cache for spark on Windows

2015-04-27 Thread mj
Hi, I'm having trouble using the --packages option for spark-shell.cmd - I have to use Windows at work and have been issued a username with a space in it that means when I use the --packages option it fails with this message: Exception in thread main java.net.URISyntaxException: Illegal

[SQL][1.3.1][JAVA]Use UDF in DataFrame join

2015-04-27 Thread Shuai Zheng
Hi All, I want to ask how to use UDF when I use join function on DataFrame. It looks like always give me the cannot solve the column name error. Anyone can give me an example on how to run this in java? My code is like: edmData.join(yb_lookup,

Re: How to distribute Spark computation recipes

2015-04-27 Thread Reynold Xin
The code themselves are the recipies, no? On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to

[SQL][1.3.1][JAVA] UDF in java cause Task not serializable

2015-04-27 Thread Shuai Zheng
Hi All, Basically I try to define a simple UDF and use it in the query, but it gives me Task not serializable public void test() { RiskGroupModelDefinition model = registeredRiskGroupMap.get(this.modelId); RiskGroupModelDefinition edm =

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
Thanks, it should be “select id, time, min(x1), min(x2), … from data group by id, time order by time” (“min” or other aggregate function to pick other fields) Forgot to mention that (id, time) is my primary key and I took for granted that it worked in my MySQL example. Best regards, Alexander

Re: Spark timeout issue

2015-04-27 Thread Akhil Das
You need to look more deep into your worker logs, you may find GC error, IO exceptions etc if you look closely which is triggering the timeout. Thanks Best Regards On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello Patrick, Sure. I've posted this on user as

Re: How to debug Spark on Yarn?

2015-04-27 Thread ๏̯͡๏
Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job is running i figured out the executor that am suppose to see, and those two links show 4 special characters on browser. 2. Tail on Yarn logs: /apache/hadoop/bin/yarn logs -applicationId application_1429087638744_151059 |

Re: Understand the running time of SparkSQL queries

2015-04-27 Thread Akhil Das
Isn't it already available on the driver UI (that runs on 4040)? Thanks Best Regards On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :) The fact that query runtime increases indicates more shuffle. You may want to construct rdds based on keys you use. You may want to specify what kind of node you are using and how many executors you are using. You may also want to play around with executor memory

A problem of using spark streaming to capture network packets

2015-04-27 Thread Hai Shan Wu
Hi Everyone We use pcap4j to capture network packets and then use spark streaming to analyze captured packets. However, we met a strange problem. If we run our application on spark locally (for example, spark-submit --master local[2]), then the program runs successfully. If we run our

Re: How to debug Spark on Yarn?

2015-04-27 Thread ๏̯͡๏
1) Application container logs from Web RM UI never load on browser. I eventually have to kill the browser. 2) /apache/hadoop/bin/yarn logs -applicationId application_1429087638744_151059 | less emits logs only after the application has completed. Are there no better ways to see the logs as they

Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread Mani
Hi, I am a graduate student from Virginia Tech (USA) pursuing my Masters in Computer Science. I’ve been researching on parallel and distributed databases and their performance for running some Range queries involving simple joins and group by on large datasets. As part of my research, I tried

Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara
You can check container logs from RM web UI or when log-aggregation is enabled with the yarn command. There are other, but less convenient options. On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao
Hi Marco, As I know, current combineByKey() does not expose the related argument where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is package private, if you can get the ShuffledRDD through reflection or other way, the keyOrdering you set will be pushed down to shuffle. If you

Re: spark-defaults.conf

2015-04-27 Thread James King
Thanks. I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile But when I start worker like this spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh I still get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf.

Bigints in pyspark

2015-04-27 Thread jamborta
hi all, I have just come across a problem where I have a table that has a few bigint columns, it seems if I read that table into a dataframe then collect it in pyspark, the bigints are stored and integers in python. (The problem is if I write it back to another table, I detect the hive type

Re: SQL UDF returning object of case class; regression from 1.2.0

2015-04-27 Thread Ophir Cohen
A short update: eventually we manually upgraded to 1.3.1 and the problem fixed. On Apr 26, 2015 2:26 PM, Ophir Cohen oph...@gmail.com wrote: I happened to hit the following issue that prevents me from using UDFs with case classes: https://issues.apache.org/jira/browse/SPARK-6054. The issue

Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all: I use function updateStateByKey in Spark Streaming, I need to store the states for one minite, I set spark.cleaner.ttl to 120, the duration is 2 seconds, but it throws Exception Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist:

Re: Exception in using updateStateByKey

2015-04-27 Thread Ted Yu
Which hadoop release are you using ? Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/ receivedData/0/log-1430139541443-1430139601443 ? Cheers On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote: Hi, all: I use function updateStateByKey in Spark Streaming, I

Re: directory loader in windows

2015-04-27 Thread Steve Loughran
This a hadoop-side stack trace it looks like the code is trying to get the filesystem permissions by running %HADOOP_HOME%\bin\WINUTILS.EXE ls -F and something is triggering a null pointer exception. There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not a

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
I make it to 240, it happens again when 240 seconds is reached. -- -- ??: 261810726;261810...@qq.com; : 2015??4??27??(??) 10:24 ??: Ted Yuyuzhih...@gmail.com; : ?? Exception in using updateStateByKey Yes??I can make

RE: Slower performance when bigger memory?

2015-04-27 Thread Shuai Zheng
Thanks. So may I know what is your configuration for more/smaller executors on r3.8xlarge, how big of the memory that you eventually decide to give one executor without impact performance (for example: 64g? ). From: Sven Krasser [mailto:kras...@gmail.com] Sent: Friday, April 24, 2015 1:59

Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s for 400MB data. The schema is similar to the TPC-H LineItem. Here is the code I tried the cache. I am wondering if there is any setting missing? Thank you so much! lineitemSchemaRDD.registerTempTable(lineitem);

Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com
Hi, I am frequently asked why spark is also much faster than Hadoop MapReduce on disk (without the use of memory cache). I have no convencing answer for this question, could you guys elaborate on this? Thanks!

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Ilya Ganelin
I believe the typical answer is that Spark is actually a bit slower. On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote: Hi, I am frequently asked why spark is also much faster than Hadoop MapReduce on disk (without the use of memory cache). I have no convencing answer for

New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake
https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/

RE: Scalability of group by

2015-04-27 Thread Ulanov, Alexander
It works on a smaller dataset of 100 rows. Probably I could find the size when it fails using binary search. However, it would not help me because I need to work with 2B rows. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Monday, April 27, 2015 6:58 PM To: Ulanov, Alexander Cc:

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
Spark keeps job in memory by default for kind of performance gains you are seeing. Additionally depending on your query spark runs stages and any point of time spark's code behind the scene may issue explicit cache. If you hit any such scenario you will find those cached objects in UI under

Re: Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread bit1...@163.com
Is it? I learned somewhere else that spark's speed is 5~10 times faster than Hadoop MapReduce. bit1...@163.com From: Ilya Ganelin Date: 2015-04-28 10:55 To: bit1...@163.com; user Subject: Re: Why Spark is much faster than Hadoop MapReduce even on disk I believe the typical answer is that

Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-27 Thread lokeshkumar
Hi Forum I am facing below compile error when using the fileStream method of the JavaStreamingContext class. I have copied the code from JavaAPISuite.java test class of spark test code. Please help me to find a solution for this.

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in Spark 1.3 as follows: “select id,

java.lang.UnsupportedOperationException: empty collection

2015-04-27 Thread xweb
I am running following code on Spark 1.3.0. It is from https://spark.apache.org/docs/1.3.0/ml-guide.html On running val model1 = lr.fit(training.toDF) I get java.lang.UnsupportedOperationException: empty collection what could be the reason? import org.apache.spark.{SparkConf,

1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-27 Thread sranga
Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is appreciated. Caused by: java.lang.AssertionError: assertion failed: Conflicting

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks   From: bit1...@163.com bit1...@163.com To: user user@spark.apache.org Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do the automation - its a bit of work to setup, but well worth the effort. On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote: It's mostly manual. You could try automating with something like Chef, of

Driver ID from spark-submit

2015-04-27 Thread Rares Vernica
Hello, I am trying to use the default Spark cluster manager in a production environment. I will be submitting jobs with spark-submit. I wonder if the following is possible: 1. Get the Driver ID from spark-submit. We will use this ID to keep track of the job and kill it if necessary. 2. Weather

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
Maybe I found the solution??do not set 'spark.cleaner.ttl', just use function 'remember' in StreamingContext to set the rememberDuration. -- -- ??: Ted Yu;yuzhih...@gmail.com; : 2015??4??27??(??) 10:20 ??: Sea261810...@qq.com;

Spark JDBC data source API issue with mysql

2015-04-27 Thread madhu phatak
Hi, I have been trying out spark data source api with JDBC. The following is the code to get DataFrame, Try(hc.load(org.apache.spark.sql.jdbc,Map(url - dbUrl,dbtable-s($ query) ))) By looking at test cases, I found that query has to be inside brackets, otherwise it's treated as table name.

Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread ๏̯͡๏
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and direct link. Each time i untar i get below error spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty error message) tar: Error exit delayed from previous errors Is it broken ? -- Deepak

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya
Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index to your data that matches the order it was stored on

data locality in spark

2015-04-27 Thread Grandl Robert
Hi guys, I am running some SQL queries, but all my tasks are reported as either NODE_LOCAL or PROCESS_LOCAL.  In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because they have to aggregate data from multiple hosts. However, in Spark even the aggregation stages are reported

RE: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Ganelin, Ilya
What command are you using to untar? Are you running out of disk space? Sent with Good (www.good.com) -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com] Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time To: user Subject: Spark 1.3.1 Hadoop

Re: Timeout Error

2015-04-27 Thread Deepak Gopalakrishnan
Hello All, I dug a little deeper and found this error : 15/04/27 16:05:39 WARN TransportChannelHandler: Exception in connection from /10.1.0.90:40590 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at

Re: What is difference btw reduce fold?

2015-04-27 Thread keegan
Hi Q, fold and reduce both aggregate over a collection by implementing an operation you specify, the major different is the starting point of the aggregation. For fold(), you have to specify the starting value, and for reduce() the starting value is the first (or possibly an arbitrary) element in

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Xiangrui Meng
How did you run the example app? Did you use spark-submit? -Xiangrui On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote: Sorry, accidentally sent the last email before finishing. I had asked this question before, but wanted to ask again as I think it is now related to my pom

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-27 Thread Xiangrui Meng
Could you try different ranks and see whether the task size changes? We do use YtY in the closure, which should work the same as broadcast. If that is the case, it should be safe to ignore this warning. -Xiangrui On Thu, Apr 23, 2015 at 4:52 AM, Christian S. Perone christian.per...@gmail.com

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Sean Owen
Works fine for me. Make sure you're not downloading the HTML redirector page and thinking it's the archive. On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors and direct link. Each time i untar i

Re: StandardScaler failing with OOM errors in PySpark

2015-04-27 Thread Xiangrui Meng
You might need to specify driver memory in spark-submit instead of passing JVM options. spark-submit is designed to handle different deployments correctly. -Xiangrui On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar rokros...@gmail.com wrote: ok yes, I think I have narrowed it down to being a problem

Re: Powered By Spark

2015-04-27 Thread Justin
Hi, Would you mind adding our company to the Powered By Spark page? organization name: atp URL: https://atp.io a list of which Spark components you are using: SparkSQL, MLLib, Databricks Cloud and a short description of your use case: Predictive models and learning algorithms to improve the

spark sql LEFT OUTER JOIN java.lang.ClassCastException

2015-04-27 Thread kiran mavatoor
Hi There, I am using spark sql left out join query.  The sql query is  scala val test = sqlContext.sql(SELECT e.departmentID FROM employee e LEFT OUTER JOIN department d ON d.departmentId = e.departmentId).toDF() In the spark 1.3.1 its working fine, but the latest pull is give the below error

Streaming app with windowing and persistence

2015-04-27 Thread Alexander Krasheninnikov
Hello, everyone. I develop stream application, working with window functions - each window create table and perform some SQL-operations on extracted data. I met such problem: when using window operations and checkpointing, application does not start next time. Here is the code:

deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-04-27 Thread Pagliari, Roberto
Suppose I have something like the code below for idx in xrange(0, 10): train_test_split = training.randomSplit(weights=[0.75, 0.25]) train_cv = train_test_split[0] test_cv = train_test_split[1] # scale train_cv and test_cv by scaling

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Su She
Hello Xiangrui, I am using this spark-submit command (as I do for all other jobs): /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit --class MLlib --master local[2] --jars $(echo /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')

Spark Job fails with 6 executors and succeeds with 8 ?

2015-04-27 Thread ๏̯͡๏
I have this Spark App and it fails when i run with 6 executors but succeeds with 8. Any suggestions ? Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Serialization error

2015-04-27 Thread madhvi
Hi, While connecting to accumulo through spark by making sparkRDD I am getting the following error: object not serializable (class: org.apache.accumulo.core.data.Key) This is due to the 'key' class of accumulo which does not implement serializable interface.How it can be solved and accumulo

Re: Unable to work with foreachrdd

2015-04-27 Thread drarse
Can u write the code? Maby is the foreachRDD body. :) El martes, 28 de abril de 2015, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n22681...@n3.nabble.com escribió: When i run spark streaming application print method is printing result it is f9, but i used foreachrdd on that

Re: gridsearch - python

2015-04-27 Thread Xiangrui Meng
We will try to make them available in 1.4, which is coming soon. -Xiangrui On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being.

Re: Spark on Mesos

2015-04-27 Thread Stephen Carman
So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine…

Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi, Could you suggest what is the best way to do group by x order by y in Spark? When I try to perform it with Spark SQL I get the following error (Spark 1.3): val results = sqlContext.sql(select * from sample group by id order by time) org.apache.spark.sql.AnalysisException: expression 'time'

Re: Group by order by

2015-04-27 Thread Richard Marscher
Hi, that error seems to indicate the basic query is not properly expressed. If you group by just ID, then that means it would need to aggregate all the time values into one value per ID, so you can't sort by it. Thus it tries to suggest an aggregate function for time so you can have 1 value per

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi Richard, There are several values of time per id. Is there a way to perform group by id and sort by time in Spark? Best regards, Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Monday, April 27, 2015 12:20 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject:

Understanding Spark's caching

2015-04-27 Thread Eran Medan
Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile(...)