how to save matrix result to file

2016-01-19 Thread zhangjp
Hi ,all I have get a Matrix type result with java , But i don't know how to save the result to a file. "Matrix cov = mat.computeCovariance();" THX.

Re: Calling SparkContext methods in scala Future

2016-01-19 Thread Marco
Thank you guys for the answers. @Ted Yu: You are right, in general the code to fetch stuff externally should be called separately, while Spark should only access the data written by these two services via flume/kafka/whatever. However, before I get there, I would like to have the Spark job ready.

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
Hi Andy, The equation to calculate IDF is: idf = log((m + 1) / (d(t) + 1)) you can refer here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L150 The equation to calculate TFIDF is: TFIDF=TF * IDF you can refer:

Different executor memory for different nodes

2016-01-19 Thread hemangshah
How to set different executor memory limits for different worker nodes? I'm using spark 1.5.2 in standalone deployment mode and launching using scripts. The executor memory is set via 'spark.executor.memory' in conf/spark-defaults.conf. This sets the same memory limit for all the worker nodes. I

storing query object

2016-01-19 Thread Gourav Sengupta
Hi, I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files. When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. Is

Spark Dataset doesn't have api for changing columns

2016-01-19 Thread Milad khajavi
Hi Spark users, when I want to map the result of count on groupBy, I need to convert the result to Dataframe, then change the column names and map the result to new case class, Why Spark Datatset API doesn't have direct functionality? case class LogRow(id: String, location: String, time: Long)

spark yarn client mode

2016-01-19 Thread Sanjeev Verma
Hi Do I need to install spark on all the yarn cluster node if I want to submit the job to yarn client? is there any way exists in which I can spawn a spark job executors on the cluster nodes where I have not installed spark. Thanks Sanjeev

RE: building spark 1.6 throws error Rscript: command not found

2016-01-19 Thread Sun, Rui
Hi, Mich, Building Spark with SparkR profile enabled requires installation of R on your building machine. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, January 19, 2016 5:27 AM To: Mich Talebzadeh Cc: user @spark Subject: Re: building spark 1.6 throws error Rscript: command not found

Re: using spark context in map funciton TASk not serilizable error

2016-01-19 Thread Ricardo Paiva
Did you try SparkContext.getOrCreate() ? You don't need to pass the sparkContext to the map function, you can retrieve it from the SparkContext singleton. Regards, Ricardo On Mon, Jan 18, 2016 at 6:29 PM, gpatcham [via Apache Spark User List] < ml-node+s1001560n25998...@n3.nabble.com> wrote:

RDD immutablility

2016-01-19 Thread ddav
Hi, Certain API's (map, mapValues) give the developer access to the data stored in RDD's. Am I correct in saying that these API's must never modify the data but always return a new object with a copy of the data if the data needs to be updated for the returned RDD. Thanks, Dave. -- View this

RE: Spark SQL -Hive transactions support

2016-01-19 Thread Hemang Nagar
Do you have any plans for supporting hive transactions in Spark? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, January 19, 2016 3:18 PM To: hnagar Cc: user Subject: Re: Spark SQL -Hive transactions support We

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-19 Thread Raghvendra Singh
Hi, there is one question. In spark-env.sh should i specify all masters for parameter SPARK_MASTER_IP. I've set SPARK_DAEMON_JAVA_OPTS already with zookeeper configuration as specified in spark documentation. Thanks & Regards Raghvendra On Wed, Jan 20, 2016 at 1:46 AM, Raghvendra Singh <

Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
Hi Tim Do you have any materials/blog for running Spark in a container in Mesos cluster environment? I have googled it but couldn't find info on it. Spark documentation says it is possible, but no details provided.. Please help Thanks Sathish On Mon, Sep 21, 2015 at 11:54 AM Tim Chen

Re: OOM on yarn-cluster mode

2016-01-19 Thread Saisai Shao
You could try increase the driver memory by "--driver-memory", looks like the OOM is came from driver side, so the simple solution is to increase the memory of driver. On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto wrote: > Hi, > > I'm having trouble when uploadig spark

Re: Split columns in RDD

2016-01-19 Thread Richard Siebeling
thanks Daniel, this will certainly help, regards, Richard On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman wrote: > edit 2: filter should be map > > val numColumns = separatedInputStrings.map{ case(id, (stateList, > numStates)) => numStates}.reduce(math.max) > > On

OOM on yarn-cluster mode

2016-01-19 Thread Julio Antonio Soto
Hi, I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the job works and completes in yarn-client mode, I hit the following error when using spark-submit in yarn-cluster (simplified): 16/01/19 21:43:31 INFO hive.metastore: Connected to metastore. 16/01/19 21:43:32 WARN

GraphX: Easy way to build fully connected grid-graph

2016-01-19 Thread benjamin.naujoks
Hello, I was wondering if there exists some elegant way to build a fully connected grid-graph. The standard grid-graph function only creates one, where a vertex is connected to the vertices of row+1 and column+1. I need for my algorithm that every vertex is connected to the vertices of row-1,

Re: Docker/Mesos with Spark

2016-01-19 Thread Tim Chen
Hi Sathish, Sorry about that, I think that's a good idea and I'll write up a section in the Spark documentation page to explain how it can work. We (Mesosphere) have been doing this for our DCOS spark for our past releases and has been working well so far. Thanks! Tim On Tue, Jan 19, 2016 at

Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Femi Anthony
I have a set of log files I would like to read into an RDD. These files are all compressed .gz and are the filenames are date stamped. The source of these files is the page view statistics data for wikipedia http://dumps.wikimedia.org/other/pagecounts-raw/ The file names look like this:

Re: Spark SQL -Hive transactions support

2016-01-19 Thread Elliot West
Hive's ACID feature (which introduces transactions) is not required for inserts, only updates and deletes. Inserts should be supported on a vanilla Hive shell. I'm not sure how Spark interacts with Hive in that regard but perhaps the HiveSQLContext implementation is lacking support. On a separate

dataframe access hive complex type

2016-01-19 Thread pth001
Hi, How dataframe (What API) can access hive complex type (Struct, Array, Maps)? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL -Hive transactions support

2016-01-19 Thread hnagar
Hive has transactions support since version 0.14. I am using Spark 1.6, and Hive 1.2.1, are transactions supported in Spark SQL now. I tried in the Spark-Shell and it gives the following error org.apache.spark.sql.AnalysisException: Unsupported language features in query: insert into test

Re: Spark SQL -Hive transactions support

2016-01-19 Thread Michael Armbrust
We don't support Hive style transaction. On Tue, Jan 19, 2016 at 11:32 AM, hnagar wrote: > Hive has transactions support since version 0.14. > > I am using Spark 1.6, and Hive 1.2.1, are transactions supported in Spark > SQL now. I tried in the Spark-Shell and it

Re: spark 1.6.0 on ec2 doesn't work

2016-01-19 Thread Calvin Jia
Hi Oleg, The Tachyon related issue should be fixed. Hope this helps, Calvin On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > > It doesn't work properly - got exceptions and at the end

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-19 Thread Raghvendra Singh
Here's the complete master log on reproducing the error http://pastebin.com/2YJpyBiF Regards Raghvendra On Wed, Jan 20, 2016 at 12:38 AM, Raghvendra Singh < raghvendra.ii...@gmail.com> wrote: > Ok I Will try to reproduce the problem. Also I don't think this is an > uncommon problem I am

Re: Spark Dataset doesn't have api for changing columns

2016-01-19 Thread Michael Armbrust
In Spark 2.0 we are planning to combine DataFrame and Dataset so that all the methods will be available on either class. On Tue, Jan 19, 2016 at 3:42 AM, Milad khajavi wrote: > Hi Spark users, > > when I want to map the result of count on groupBy, I need to convert the >

Re: Serializing DataSets

2016-01-19 Thread Simon Hafner
The occasional type error if the casting goes wrong for whatever reason. 2016-01-19 1:22 GMT+08:00 Michael Armbrust : > What error? > > On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner wrote: >> >> And for deserializing, >>

Re: when enable kerberos in hdp, the spark does not work

2016-01-19 Thread Steve Loughran
On 18 Jan 2016, at 23:39, 李振 > wrote: : java.io.IOException: java.net.ConnectException: Connection refused at org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:888) at

Re: SparkR with Hive integration

2016-01-19 Thread Felix Cheung
You might need hive-site.xml _ From: Peter Zhang Sent: Monday, January 18, 2016 9:08 PM Subject: Re: SparkR with Hive integration To: Jeff Zhang Cc: Thanks,  I will try.

Re: Spark Cassandra Java Connector: records missing despite consistency=ALL

2016-01-19 Thread Femi Anthony
So is the logging to Cassandra being done via Spark ? On Wed, Jan 13, 2016 at 7:17 AM, Dennis Birkholz wrote: > Hi together, > > we Cassandra to log event data and process it every 15 minutes with Spark. > We are using the Cassandra Java Connector for Spark. > > Randomly

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread ayan guha
It is not scanning the HBase. What it is doing is looping through your list of Row keys and fetching data for each 1 at a time. Ex: Your solr result has 5 records, with Row Keys R1...R5. Then list will be [R1,R2,...R5] Then table.get(list) will do something like: res=[] for k in list: v =

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread Ted Yu
get(List gets) will call: Object [] r1 = batch((List)gets); where batch() would do: AsyncRequestFuture ars = multiAp.submitAll(pool, tableName, actions, null, results); ars.waitUntilDone(); multiAp is an AsyncProcess. In short, client would access region server for the results.

Re: OOM on yarn-cluster mode

2016-01-19 Thread Julio Antonio Soto de Vicente
Hi, I tried with --driver-memory 16G (more than enough to read a simple parquet table), but the problem still persists. Everything works fine in yarn-client. -- Julio Antonio Soto de Vicente > El 19 ene 2016, a las 22:18, Saisai Shao escribió: > > You could try

Re: Docker/Mesos with Spark

2016-01-19 Thread Sathish Kumaran Vairavelu
Thank you! Looking forward for it.. On Tue, Jan 19, 2016 at 4:03 PM Tim Chen wrote: > Hi Sathish, > > Sorry about that, I think that's a good idea and I'll write up a section > in the Spark documentation page to explain how it can work. We (Mesosphere) > have been doing

Re: Docker/Mesos with Spark

2016-01-19 Thread Darren Govoni
I also would be interested in some best practice for making this work. Where will the writeup be posted? On mesosphere website? Sent from my Verizon Wireless 4G LTE smartphone Original message From: Sathish Kumaran Vairavelu Date: 01/19/2016

spark dataframe jdbc read/write using dbcp connection pool

2016-01-19 Thread fightf...@163.com
Hi , I want to load really large volumn datasets from mysql using spark dataframe api. And then save as parquet file or orc file to facilitate that with hive / Impala. The datasets size is about 1 billion records and when I am using the following naive code to run that , Error occurs and

is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread beeshma r
Hi I trying to integrated Hbase-solr-spark. Solr is indexing all the documents from Hbase through hbase-indexer . Through the Spark I am manipulating all datasets .Thing is after getting the solrdocuments from the solr query ,it has the rowkey and rowvalues .So directly i got the rowkeys and

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread beeshma r
Thanks Ted, :) if everything gets indexed from Hbase into solr ,then no need to trace Regionservers once again Thanks Beesh On Wed, Jan 20, 2016 at 5:05 AM, Ted Yu wrote: > get(List gets) will call: > > Object [] r1 = batch((List)gets); > > where batch() would

process of executing a program in a distributed environment without hadoop

2016-01-19 Thread Kamaruddin
I want to execute a program in a distributed environment without using hadoop and only in spark cluster. What is the best way to use it? -- View this message in context:

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Akhil Das
You can use the sc.newAPIHadoopFile and pass your own InputFormat and RecordReader which will read the compressed .gz files to your usecase. For a start, you can look at the: - wholeTextFile implementation

Re: How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-19 Thread Vishal Maru
It seems Spark is not able to serialize your function code to worker nodes. I have tried to put a solution in simple set of commands. Maybe you can combine last four line into function. val arr = Array((1,"A","<20","0"), (1,"A",">20 & <40","1"), (1,"B",">20 & <40","0"), (1,"C",">20 & <40","0"),

Re: process of executing a program in a distributed environment without hadoop

2016-01-19 Thread Akhil Das
If you are processing a file, then you can keep the same file in all machines in the same location and everything should work. Thanks Best Regards On Wed, Jan 20, 2016 at 11:15 AM, Kamaruddin wrote: > I want to execute a program in a distributed environment without

Re: Concurrent Spark jobs

2016-01-19 Thread Madabhattula Rajesh Kumar
Hi, Just a thought. Can we use Spark Job Server and trigger jobs through rest apis. In this case, all jobs will share same context and run the jobs parallel. If any one has other thoughts please share Regards, Rajesh On Tue, Jan 19, 2016 at 10:28 PM, emlyn wrote: > We

Re: Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-19 Thread fightf...@163.com
Hi, Thanks a lot for your suggestion. I then tried the following code : val prop = new java.util.Properties prop.setProperty("user","test") prop.setProperty("password", "test") prop.setProperty("partitionColumn", "added_year") prop.setProperty("lowerBound", "1985")

Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-19 Thread 刘虓
Hi, I suggest you partition the JDBC reading on a indexed column of the mysql table 2016-01-20 10:11 GMT+08:00 fightf...@163.com : > Hi , > I want to load really large volumn datasets from mysql using spark > dataframe api. And then save as > parquet file or orc file to

Redundant common columns of nature full outer join

2016-01-19 Thread Zhong Wang
Hi all, I am joining two tables with common columns using full outer join. However, the current Dataframe API doesn't support nature joins, so the output contains redundant common columns from both of the tables. Is there any way to remove these redundant columns for a "nature" full outer join?

Re: Parquet write optimization by row group size config

2016-01-19 Thread Akhil Das
Did you try re-partitioning the data before doing the write? Thanks Best Regards On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Hello, > I'm using spark on some machines in standalone mode, data storage is > mounted on this machines via nfs. A have

Re: Docker/Mesos with Spark

2016-01-19 Thread Nagaraj Chandrashekar
Hi John, I recently deployed Redis instances using Kubernetes framework on Apache Mesos. Kubernetes uses POD concept and you can run your requirements (Redis/Spark) as a docker container and also adds up some of the HA concepts to the instances. Cheers Nagaraj C From: Darren Govoni

Re: SparkContext SyntaxError: invalid syntax

2016-01-19 Thread Felix Cheung
I have to run this to install the pre-req to get jeykyll build to work, you do need the python pygments package: (I’m on ubuntu)sudo apt-get install ruby ruby-dev make gcc nodejssudo gem install jekyll --no-rdoc --no-risudo gem install jekyll-redirect-fromsudo apt-get install

Re: RDD immutablility

2016-01-19 Thread Marco
It depends on what you mean by "write access". The RDDs are immutable, so you can't really change them. When you apply a mapping/filter/groupBy function, you are creating a new RDD starting from the original one. Kind regards, Marco 2016-01-19 13:27 GMT+01:00 Dave : >

Is there a way to co-locate partitions from two partitioned RDDs?

2016-01-19 Thread nwali
Hi, I am working with Spark in Java on top of a HDFS cluster. In my code two RDDs are partitioned with the same partitioner (HashPartitioner with the same number of partitions), so they are co-partitioned. Thus same keys are on the same partitions' number but that does not mean that both RDDs are

Re: RDD immutablility

2016-01-19 Thread Sean Owen
It's a good question. You can easily imagine an RDD of classes that are mutable. Yes, if you modify these objects, the result is pretty undefined, so don't do that. On Tue, Jan 19, 2016 at 12:27 PM, Dave wrote: > Hi Marco, > > Yes, that answers my question. I just wanted

Parquet write optimization by row group size config

2016-01-19 Thread Pavel Plotnikov
Hello, I'm using spark on some machines in standalone mode, data storage is mounted on this machines via nfs. A have input data stream and when i'm trying to store all data for hour in parquet, a job executes mostly on one core and this hourly data are stored in 40- 50 minutes. It is very slow!

Re: RDD immutablility

2016-01-19 Thread Dave
Thanks Sean. On 19/01/16 13:36, Sean Owen wrote: It's a good question. You can easily imagine an RDD of classes that are mutable. Yes, if you modify these objects, the result is pretty undefined, so don't do that. On Tue, Jan 19, 2016 at 12:27 PM, Dave wrote: Hi

Re: RDD immutablility

2016-01-19 Thread Marco
Hello, RDD are immutable by design. The reasons, to quote Sean Owen in this answer ( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the following : Immutability rules out a big set of potential problems due to updates from > multiple threads at once. Immutable data is definitely safe

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread praveen S
Can you give me more details on Spark's jobserver. Regards, Praveen On 18 Jan 2016 03:30, "Jia" wrote: > I guess all jobs submitted through JobServer are executed in the same JVM, > so RDDs cached by one job can be visible to all other jobs executed later. > On Jan 17,

Re: RDD immutablility

2016-01-19 Thread Dave
Hi Marco, Yes, that answers my question. I just wanted to be sure as the API gave me write access to the immutable data which means its up to the developer to know not to modify the input parameters for these API's. Thanks for the response. Dave. On 19/01/16 12:25, Marco wrote: Hello, RDD

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-19 Thread Jerry Lam
Is cacheTable similar to asTempTable before? Sent from my iPhone > On 19 Jan, 2016, at 4:18 am, George Sigletos wrote: > > Thanks Kevin for your reply. > > I was suspecting the same thing as well, although it still does not make much > sense to me why would you need

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread Jia
Hi, Praveen, have you checked out this, which might have the details you need: https://spark-summit.org/2014/wp-content/uploads/2014/07/Spark-Job-Server-Easy-Spark-Job-Management-Chan-Chu.pdf Best Regards, Jia On Jan 19, 2016, at 7:28 AM, praveen S wrote: > Can you give

Re: Split columns in RDD

2016-01-19 Thread Sabarish Sasidharan
The most efficient to determine the number of columns would be to do a take(1) and split in the driver. Regards Sab On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote: > Hi, > > what is the most efficient way to split columns and know how many columns > are created. > >

Can I configure Spark on multiple nodes using local filesystem on each node?

2016-01-19 Thread Jia Zou
Dear all, Can I configure Spark on multiple nodes without HDFS, so that output data will be written to the local file system on each node? I guess there is no such feature in Spark, but just want to confirm. Best Regards, Jia

can we create dummy variables from categorical variables, using sparkR

2016-01-19 Thread Devesh Raj Singh
Hi, Can we create dummy variables for categorical variables in sparkR like we do using "dummies" package in R -- Warm regards, Devesh.

Split columns in RDD

2016-01-19 Thread Richard Siebeling
Hi, what is the most efficient way to split columns and know how many columns are created. Here is the current RDD - ID STATE - 1 TX, NY, FL 2 CA, OH - This is the preferred output: - IDSTATE_1 STATE_2

Re: Can I configure Spark on multiple nodes using local filesystem on each node?

2016-01-19 Thread Pavel Plotnikov
Hi, I'm using Spark in standalone mode without HDFS, and shared folder is mounted on nodes via nfs. It looks like each node write data like in local file system. Regards, Pavel On Tue, Jan 19, 2016 at 5:39 PM Jia Zou wrote: > Dear all, > > Can I configure Spark on

Re: spark yarn client mode

2016-01-19 Thread 刘虓
Hi, No,you don't need to. However,when submitting jobs certain resources will be uploaded to hdfs,which could be a performance issue read the log and you will understand: 15/12/29 11:10:06 INFO Client: Uploading resource file:/data/spark/spark152/lib/spark-assembly-1.5.2-hadoop2.6.0.jar -> hdfs

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
Hi Richard, If I understand the question correctly it sounds like you could probably do this using mapValues (I'm assuming that you want two pieces of information out of all rows, the states as individual items, and the number of states in the row) val separatedInputStrings = input:RDD[(Int,

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
edit: Mistake in the second code example val numColumns = separatedInputStrings.filter{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman wrote: > Hi Richard, > > If I understand the question correctly it

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread Gene Pang
Yes, you can share RDDs with Tachyon, while keeping the data in memory. Spark jobs can write to a Tachyon path (tachyon://host:port/path/) and other jobs can read from the same path. Here is a presentation that includes that use case:

Re: Split columns in RDD

2016-01-19 Thread Richard Siebeling
that's true and that's the way we're doing it now but then we're only using the first row to determine the number of splitted columns. It could be that in the second (or last) row there are 10 new columns and we'd like to know that too. Probably a reduceby operator can be used to do that, but I'm

RangePartitioning

2016-01-19 Thread ddav
Hi, I have the following pair RDD created in java. JavaPairRDD progRef = sc.textFile(programReferenceDataFile, 12).filter( (String s) -> !s.startsWith("#")).mapToPair( (String s) -> {

Concurrent Spark jobs

2016-01-19 Thread emlyn
We have a Spark application that runs a number of ETL jobs, writing the outputs to Redshift (using databricks/spark-redshift). This is triggered by calling DataFrame.write.save on the different DataFrames one after another. I noticed that during the Redshift load while the output of one job is

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-19 Thread Vinayak Agrawal
Yes, you can use Rformula library. Please see https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html On Tue, Jan 19, 2016 at 10:34 AM, Devesh Raj Singh wrote: > Hi, > > Can we create dummy variables for categorical

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman
edit 2: filter should be map val numColumns = separatedInputStrings.map{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman wrote: > edit: Mistake in the second code example > > val numColumns =