RE: Need to user univariate summary stats

2016-02-04 Thread Lohith Samaga M
Hi Arun, You can do df.agg(max(,,), min(..)). Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Arunkumar Pillai [mailto:arunkumar1...@gmail.com] Sent: Thursday, February 04, 2016 14.53 To: user@spark.apache.org Subject: Need to user univariate

Re: library dependencies to run spark local mode

2016-02-04 Thread Ted Yu
Which Spark release are you using ? Is there other clue from the logs ? If so, please pastebin. Cheers On Thu, Feb 4, 2016 at 2:49 AM, Valentin Popov wrote: > Hi all, > > I’m trying run spark on local mode, i using such code: > > SparkConf conf = new

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Alonso Isidoro Roman
"But learned that it is better not to reduce it to 0." could you explain a bit more this sentence? thanks Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger

Re: library dependencies to run spark local mode

2016-02-04 Thread Valentin Popov
It is 1.6.0 builded from sources. I’m trying it on mine eclipse project and want use spark on it, so I put libraries there and have no ClassNotFoundException akka-actor_2.10-2.3.11.jar akka-remote_2.10-2.3.11.jar akka-slf4j_2.10-2.3.11.jar config-1.2.1.jar hadoop-auth-2.7.1.jar

Re: [Spark 1.6] Mismatch in kurtosis values

2016-02-04 Thread Sean Owen
It's returning excess kurtosis and not the fourth moment, strictly speaking. I don't have docs in front of me to check but if that's not documented it should be. On Thu, Feb 4, 2016, 15:22 Arunkumar Pillai wrote: > Hi > > I have observed that kurtosis values coming from

library dependencies to run spark local mode

2016-02-04 Thread Valentin Popov
Hi all, I’m trying run spark on local mode, i using such code: SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample").setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(conf); but after while (10 sec) I got Exception, here is a stack trace:

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
Okay, the reason for the task delay within executor when some RDD in memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case Scheduler waits for *spark.locality.wait *3 seconds default. During this period, scheduler waits to launch a data-local task before giving

[Spark 1.6] Mismatch in kurtosis values

2016-02-04 Thread Arunkumar Pillai
Hi I have observed that kurtosis values coming from apache spark has a difference of 3. The value coming from excel and in R as same values(11.333) but the kurtosis value coming from spark1.6 differs by 3 (8.333). Please let me know if I'm doing something wrong. I'm executing via

problem in creating function in sparkR for dummy handling

2016-02-04 Thread Devesh Raj Singh
Hi, I have written a code to create dummy variables in sparkR df <- createDataFrame(sqlContext, iris) class(dtypes(df)) cat.column<-vector(mode="character",length=nrow(df)) cat.column<-collect(select(df,df$Species)) lev<-length(levels(as.factor(unlist(cat.column for (j in 1:lev){

Re: library dependencies to run spark local mode

2016-02-04 Thread Valentin Popov
I think this is an answer… HADOOP_HOME or hadoop.home.dir are not set. Sorry 2016-02-04 14:10:08 o.a.h.u.Shell [DEBUG] Failed to detect a valid hadoop home directory java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at

Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread habitats
Hello I have ~5 million text documents, each around 10-15KB in size, and split into ~15 columns. I intend to do machine learning, and thus I need to extract all of the data at the same time, and potentially update everything on every run. So far I've just used json serializing, or simply cached

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Patrick Skjennum
(Am I doing this mailinglist thing right? Never used this ...) I do not have a cluster. Initially I tried to setup hadoop+hbase+spark, but after spending a week trying to get work, I gave up. I had a million problems with mismatching versions, and things working locally on the server, but not

cause of RPC error?

2016-02-04 Thread AlexG
I am simply trying to load an RDD from disk with transposeRowsRDD.avro(baseInputFname).rdd.map( ) and I get this error in my log: 16/02/04 11:44:07 ERROR TaskSchedulerImpl: Lost executor 7 on nid00788: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network

RE: submit spark job with spcified file for driver

2016-02-04 Thread Mohammed Guller
Here is the description for the --file option that you can specify to spark-submit: --files FILES Comma-separated list of files to be placed in the working directory of each executor. Mohammed Author: Big Data Analytics with Spark -Original Message- From: alexeyy3

Re: submit spark job with spcified file for driver

2016-02-04 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala You would see '--files' argument On Thu, Feb 4, 2016 at 2:17 PM, alexeyy3 wrote: > Is it possible to specify a file (with key-value properties) when > submitting > spark app

submit spark job with spcified file for driver

2016-02-04 Thread alexeyy3
Is it possible to specify a file (with key-value properties) when submitting spark app with spark-submit? Some mails refers to the key --file, but docs. does not mention it. If you can specify a file, how to access it from spark job driver? Thank you, Alexey -- View this message in context:

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Ted Yu
bq. had a hard time setting it up Mind sharing your experience in more detail :-) If you already have a hadoop cluster, it should be relatively straight forward to setup. Tuning needs extra effort. On Thu, Feb 4, 2016 at 12:58 PM, habitats wrote: > Hello > > I have ~5

Re: Dataset Encoders for SparseVector

2016-02-04 Thread Michael Armbrust
We are hoping to add better support for UDTs in the next release, but for now you can use kryo to generate an encoder for any class: implicit val vectorEncoder = org.apache.spark.sql.Encoders.kryo[SparseVector] On Thu, Feb 4, 2016 at 12:22 PM, raj.kumar wrote: > Hi, >

Re: cause of RPC error?

2016-02-04 Thread AlexG
To clarify, that's the tail of the node stderr log, so the last message shown is at the EOF. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cause-of-RPC-error-tp26151p26152.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: add new column in the schema + Dataframe

2016-02-04 Thread Mohammed Guller
Hi Divya, You can use the withColumn method from the DataFrame API. Here is the method signature: def withColumn(colName: String, col: Column): DataFrame Mohammed Author: Big Data Analytics with

pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-04 Thread Divya Gehlot
Hi, I have two input datasets First input dataset like as below : year,make,model,comment,blank > "2012","Tesla","S","No comment", > 1997,Ford,E350,"Go get one now they are going fast", > 2015,Chevy,Volt Second Input dataset : TagId,condition > 1997_cars,year = 1997 and

different behavior while using createDataFrame and read.df in SparkR

2016-02-04 Thread Devesh Raj Singh
Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, iris) #creating a new column for category "Setosa" df$Species1<-ifelse((df)[[5]]=="setosa",1,0) head(df) output: new column created Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1

Add Singapore meetup

2016-02-04 Thread Li Ming Tsai
Hi, Realised that Singapore has not been added. Please add http://www.meetup.com/Spark-Singapore/ Thanks!

Unit test with sqlContext

2016-02-04 Thread Steve Annessa
I'm trying to unit test a function that reads in a JSON file, manipulates the DF and then returns a Scala Map. The function has signature: def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext) I've created a bootstrap spec for spark jobs that instantiates the Spark Context

RE: Kafka directsream receiving rate

2016-02-04 Thread Diwakar Dhanuskodi
Adding more info Batch  interval  is  2000ms. I expect all 100 messages  go thru one  dstream from  directsream but it receives at rate of 10 messages at time. Am  I missing  some  configurations here. Any help appreciated.  Regards  Diwakar. Sent from Samsung Mobile. Original

Slowness in Kmeans calculating fastSquaredDistance

2016-02-04 Thread Li Ming Tsai
Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Udo Fholl
Thank you for your response Unfortunately I cannot share a thread dump. What are you looking for exactly? Here is the list of the 50 biggest objects (retained size order, descendent): java.util.concurrent.ArrayBlockingQueue# java.lang.Object[]#

spark.storage.memoryFraction for shuffle-only jobs

2016-02-04 Thread Ruslan Dautkhanov
For a Spark job that only does shuffling (e.g. Spark SQL with joins, group bys, analytical functions, order bys), but no explicit persistent RDDs nor dataframes (there are no .cache()es in the code), what would be the lowest recommended setting for spark.storage.memoryFraction?

rdd cache priority

2016-02-04 Thread charles li
say I have 2 RDDs, RDD1 and RDD2. both are 20g in memory. and I cache both of them in memory using RDD1.cache() and RDD2.cache() the in the further steps on my app, I never use RDD1 but use RDD2 for lots of time. then here is my question: if there is only 40G memory in my cluster, and here

Driver not able to restart the job automatically after the application of Streaming with Kafka Direct went down

2016-02-04 Thread SRK
Hi, I have the Streaming job running in qa/prod. Due to Kafka issues both the jobs went down. After the Kafka issues got resolved and after the deletion of the checkpoint directory the driver in the qa job restarted the job automatically and the application UI was up. But, in the prod job, the

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
I guess it may be some dead-lock in BlockGenerator. Could you check it by yourself? On Thu, Feb 4, 2016 at 4:14 PM, Udo Fholl wrote: > Thank you for your response > > Unfortunately I cannot share a thread dump. What are you looking for > exactly? > > Here is the list of

kafkaDirectStream usage error

2016-02-04 Thread Diwakar Dhanuskodi
I am  using  below  directsream to consume  messages from kafka . Topic has 8 partitions.  val topicAndPart =  OffsetRange.create("request5",0, 1,10).topicAndPartition()     val fromOffsets = Map[kafka.common.TopicAndPartition,Long](topicAndPart->0)     val messageHandler = (mmd :

PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Yuval.Itzchakov
Hi, I've been playing with the expiramental PairDStreamFunctions.mapWithState feature and I've seem to have stumbled across a bug, and was wondering if anyone else has been seeing this behavior. I've opened up an issue in the Spark JIRA, I simply want to pass this along in case anyone else is

spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread charles li
if set spark.executor.memory = 2G for each worker [ 10 in total ] does it mean I can cache 20G RDD in memory ? if so, how about the memory for code running in each process on each worker? thanks. -- and is there any materials about memory management or resource management in spark ? I want to

Re: SQL Statement on DataFrame

2016-02-04 Thread Nishant Aggarwal
Hi Ted, I am using Spark-Shell to do this. I am using Phoenix's client jar for integrating Spark with HBASE. All the operations will be done on Spark side. Thanks, Nishant Thanks and Regards Nishant Aggarwal, PMP Cell No:- +91 99588 94305 http://in.linkedin.com/pub/nishant-aggarwal/53/698/11b

Re: Unit test with sqlContext

2016-02-04 Thread Silvio Fiorito
Hi Steve, Have you looked at the spark-testing-base package by Holden? It’s really useful for unit testing Spark apps as it handles all the bootstrapping for you. https://github.com/holdenk/spark-testing-base DataFrame examples are here:

Re: rdd cache priority

2016-02-04 Thread Takeshi Yamamuro
Hi, u're right; rdd3 is not totally cached and it is re-computed every time. If MEMORY_AND_DISK, rdd3 is written to disk. Also, the current Spark does not automatically unpersist rdds depends on frequency of use. On Fri, Feb 5, 2016 at 12:15 PM, charles li wrote: >

Re: spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread Rishi Mishra
You would probably like to see http://spark.apache.org/docs/latest/configuration.html#memory-management. Other config parameters are also explained there. On Fri, Feb 5, 2016 at 10:56 AM, charles li wrote: > if set spark.executor.memory = 2G for each worker [ 10 in

SQL Statement on DataFrame

2016-02-04 Thread Nishant Aggarwal
Dear All, I am working on a scenario mentioned below. Need your help: Task: Load the data from HBASE using Phoenix into Spark as a DataFrame, do the operation and store the data back to HBASE using Phoenix. I know this is feasible via writing code. My question is, Is it possible to load the

Re: Please help with external package using --packages option in spark-shell

2016-02-04 Thread Jeff - Data Bean Australia
Thanks Divya for helping me. It does connect to the internet. And I even tried to use a local artifactory repository and it didn't work either. $ ./spark-shell --packages harsha2010:magellan:1.0.3-s_2.10 --repositories http://localhost:8081/artifactory/libs-release Ivy Default Cache set to:

Re: sc.textFile the number of the workers to parallelize

2016-02-04 Thread Takeshi Yamamuro
Hi, ISTM these tasks are just assigned with executors in preferred nodes, so how about repartitioning rdd? s3File.repartition(9).count On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao wrote: > Hi, > > > > I have a question on the number of workers that Spark enable to > parallelize

Re: sc.textFile the number of the workers to parallelize

2016-02-04 Thread Koert Kuipers
increase minPartitions: sc.textFile(path, minPartitions = 9) On Thu, Feb 4, 2016 at 11:41 PM, Takeshi Yamamuro wrote: > Hi, > > ISTM these tasks are just assigned with executors in preferred nodes, so > how about repartitioning rdd? > > s3File.repartition(9).count > > On

Re: Unit test with sqlContext

2016-02-04 Thread Rishi Mishra
Hi Steve, Have you cleaned up your SparkContext ( sc.stop()) , in a afterAll(). The error suggests you are creating more than one SparkContext. On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau wrote: > Thanks for recommending spark-testing-base :) Just wanted to add if

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Nick Pentreath
If I'm not mistaken, your data seems to be about 50MB of text documents? In which case simple flat text files in JSON or CSV seems ideal, as you are already doing. If you are using Spark then DataFrames can read/write either of these formats. For that size of data you may not require Spark.

Re: SQL Statement on DataFrame

2016-02-04 Thread Ted Yu
Did you mean using bin/sqlline.py to perform the query ? Have you asked on Phoenix mailing list ? Phoenix has phoenix-spark module. Cheers On Thu, Feb 4, 2016 at 7:28 PM, Nishant Aggarwal wrote: > Dear All, > > I am working on a scenario mentioned below. Need your

Re: Unit test with sqlContext

2016-02-04 Thread Holden Karau
Thanks for recommending spark-testing-base :) Just wanted to add if anyone has feature requests for Spark testing please get in touch (or add an issue on the github) :) On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Steve, > > Have you looked at the

Need to user univariate summary stats

2016-02-04 Thread Arunkumar Pillai
Hi I'm currently using query sqlContext.sql("SELECT MAX(variablesArray) FROM " + tableName) to extract mean max min. is there any better optimized way ? In the example i saw df.groupBy("key").agg(skewness("a"), kurtosis("a")) But i don't have key anywhere in the data. How to extract the

add new column in the schema + Dataframe

2016-02-04 Thread Divya Gehlot
Hi, I am beginner in spark and using Spark 1.5.2 on YARN.(HDP2.3.4) I have a use case where I have to read two input files and based on certain conditions in second input file ,have to add a new column in the first input file and save it . I am using spark-csv to read my input files . Would

[Spark 1.6] Univariate Stats using apache spark

2016-02-04 Thread Arunkumar Pillai
Hi Currently after creating a dataframe i'm queryingmax max min mean it to get result. sqlContext.sql("SELECT MAX(variablesArray) FROM " + tableName) Is this an optimized way? I'm not able to find the all stats like min max mean variance skewness kurtosis directly from a dataframe Please

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Yuval Itzchakov
Let me know if you do need a pull request for this, I can make that happen (given someone does a vast PR to make sure I'm understanding this problem right). On Thu, Feb 4, 2016 at 8:21 PM, Shixiong(Ryan) Zhu wrote: > Thanks for reporting it. I will take a look. > > On

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
Hey Udo, mapWithState usually uses much more memory than updateStateByKey since it caches the states in memory. However, from your description, looks BlockGenerator cannot push data into BlockManager, there may be something wrong in BlockGenerator. Could you share the top 50 objects in the heap

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Tathagata Das
Shixiong has already opened the PR - https://github.com/apache/spark/pull/11081 On Thu, Feb 4, 2016 at 11:11 AM, Yuval Itzchakov wrote: > Let me know if you do need a pull request for this, I can make that happen > (given someone does a vast PR to make sure I'm understanding

Re: Reading large set of files in Spark

2016-02-04 Thread Ted Yu
For question #2, see the following method of FileSystem : public abstract boolean delete(Path f, boolean recursive) throws IOException; FYI On Thu, Feb 4, 2016 at 10:58 AM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am using Spark to read large set of files from

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Yuval Itzchakov
Awesome. Thanks for the super fast reply. On Thu, Feb 4, 2016, 21:16 Tathagata Das wrote: > Shixiong has already opened the PR - > https://github.com/apache/spark/pull/11081 > > On Thu, Feb 4, 2016 at 11:11 AM, Yuval Itzchakov > wrote: > >> Let

Using jar bundled log4j.xml on worker nodes

2016-02-04 Thread Matthias Niehoff
Hello everybody, we’ve bundle our log4j.xml into our jar (in the classpath root). I’ve added the log4j.xml to the spark-defaults.conf with spark.{driver,executor}.extraJavaOptions=-Dlog4j.configuration=log4j.xml There is no log4j.properties or log4j.xml in one of the conf folders on any

Question on RDD caching

2016-02-04 Thread Vishnu Viswanath
Hello, When we call cache() or persist(MEMORY_ONLY), how does the request flow to the nodes? I am assuming this will happen: 1. Driver knows which all nodes hold the partition for the given rdd (where is this info stored?) 2. It sends a cache request to the node's executor 3. The executor will

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-04 Thread Ted Yu
Jay: It would be nice if you can patch Spark with below PR and give it a try. Thanks On Wed, Feb 3, 2016 at 6:03 PM, Ted Yu wrote: > Created a pull request: > https://github.com/apache/spark/pull/11066 > > FYI > > On Wed, Feb 3, 2016 at 1:27 PM, Shipper, Jay [USA]

Spark Cassandra Atomic Inserts

2016-02-04 Thread Flaherty, Frank
Cassandra provides "BEGIN BATCH" and "APPLY BATCH" to perform atomic execution of multiple statements as below: BEGIN BATCH INSERT INTO "user_status_updates" ("username", "id", "body") VALUES( 'dave', 16e2f240-2afa-11e4-8069-5f98e903bf02, 'dave update 4' ); INSERT INTO

Memory tuning in spark sql

2016-02-04 Thread ARUN.BONGALE
Hi Sir/madam, Greetings of the day. I am working on Spark 1.6.0 with AWS EMR(Elastic Map Reduce). I'm facing some issues in reading large(500 mb) file in spark-sql. Sometimes i get heap space error and sometimes the executors fail. i have increased the driver memory, executor memory, kryo

Re: spark streaming web ui not showing the events - direct kafka api

2016-02-04 Thread Cody Koeninger
There have been changes to visibility of info in ui between 1.4 and 1.5, I can't say off the top of my head at which point versions they took place. On Thu, Feb 4, 2016 at 12:07 AM, vimal dinakaran wrote: > No I am using DSE 4.8 which has spark 1.4. Is this a known issue ?

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
If spark.locality.wait is 0, then there are two performance issues: 1. Task Scheduler won't wait to schedule the tasks as DATA_LOCAL, will launch it immediately on some node even if it is less local. The probability of tasks running as less local will be higher and affect the overall Job

Re: Using jar bundled log4j.xml on worker nodes

2016-02-04 Thread Ted Yu
Have you taken a look at SPARK-11105 ? Cheers On Thu, Feb 4, 2016 at 9:06 AM, Matthias Niehoff < matthias.nieh...@codecentric.de> wrote: > Hello everybody, > > we’ve bundle our log4j.xml into our jar (in the classpath root). > > I’ve added the log4j.xml to the spark-defaults.conf with > >

Re: Memory tuning in spark sql

2016-02-04 Thread Ted Yu
Please take a look at SPARK-1867. The discussion was very long. You may want to look for missing classes. Also see https://bugs.openjdk.java.net/browse/JDK-7172206 On Thu, Feb 4, 2016 at 10:31 AM, wrote: > Hi Ted. Thanks for the response. > > i'm just trying to do

Reading large set of files in Spark

2016-02-04 Thread Akhilesh Pathodia
Hi, I am using Spark to read large set of files from HDFS, applying some formatting on each line and then saving each line as a record in hive. Spark is reading directory paths from kafka. Each directory can have large number of files. I am reading one path from kafka and then processing all

Re: Memory tuning in spark sql

2016-02-04 Thread Ted Yu
Can you provide a bit more detail ? values of the parameters you have tuned log snippets from executors snippet of your code Thanks On Thu, Feb 4, 2016 at 9:48 AM, wrote: > Hi Sir/madam, > Greetings of the day. > > I am working on Spark 1.6.0 with AWS EMR(Elastic

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Shixiong(Ryan) Zhu
Thanks for reporting it. I will take a look. On Thu, Feb 4, 2016 at 6:56 AM, Yuval.Itzchakov wrote: > Hi, > I've been playing with the expiramental PairDStreamFunctions.mapWithState > feature and I've seem to have stumbled across a bug, and was wondering if > anyone else has

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Sachin Aggarwal
i think we need to add port http://serverfault.com/questions/317903/aws-ec2-open-port-8080 do u remember doing anything like this earlier for aws 1 On Fri, Feb 5, 2016 at 1:07 AM, Yuval Itzchakov wrote: > Awesome. Thanks for the super fast reply. > > > On Thu, Feb 4, 2016,

Re: DataFrame First method is resulting different results in each iteration

2016-02-04 Thread Ali Tajeldin EDU
Hi Satish, Take a look at the smvTopNRecs() function in the SMV package. It does exactly what you are looking for. It might be overkill to bring in all of SMV for just one function but you will also get a lot more than just DF helper functions (modular views, higher level graphs, dynamic

Dataset Encoders for SparseVector

2016-02-04 Thread raj.kumar
Hi, I have a DataFrame df with a column "feature" of type SparseVector that results from the ml library's VectorAssembler class. I'd like to get a Dataset of SparseVectors from this column, but when I do a df.as[SparseVector] scala complains that it doesn't know of an encoder for

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Sachin Aggarwal
http://coenraets.org/blog/2011/11/set-up-an-amazon-ec2-instance-with-tomcat-and-mysql-5-minutes-tutorial/ The default Tomcat server uses port 8080. You need to open that port on your instance to make sure your Tomcat server is available on the Web (you could also change the default port). In the

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Sachin Aggarwal
I am sorry for spam, I replied in wrong thread sleepy head :-( On Fri, Feb 5, 2016 at 1:15 AM, Sachin Aggarwal wrote: > > http://coenraets.org/blog/2011/11/set-up-an-amazon-ec2-instance-with-tomcat-and-mysql-5-minutes-tutorial/ > > The default Tomcat server uses port

sc.textFile the number of the workers to parallelize

2016-02-04 Thread Lin, Hao
Hi, I have a question on the number of workers that Spark enable to parallelize the loading of files using sc.textFile. When I used sc.textFile to access multiple files in AWS S3, it seems to only enable 2 workers regardless of how many worker nodes I have in my cluster. So how does Spark

Re: Dynamic sql in Spark 1.5

2016-02-04 Thread Ali Tajeldin EDU
Sorry, I'm not quite following what your intent is. Not sure how TagN column is being derived. Is Dataset1 your input and Dataset 2 your output? I Don't see the relationship between them clearly. Can you describe your input, and the expected output. -- Ali On Feb 2, 2016, at 11:28 PM,

Re: Broadcast join on multiple dataframes

2016-02-04 Thread Srikanth
Hello, Any pointers on what is causing the optimizer to convert broadcast to shuffle join? This join is with a file that is just 4kb in size. Complete plan --> https://www.dropbox.com/s/apuomw1dg0t1jtc/plan_with_select.txt?dl=0 DAG from UI -->

Re: Re: About cache table performance in spark sql

2016-02-04 Thread Takeshi Yamamuro
Hi, Parquet data are column-wise and highly compressed, so the size of deserialized rows in spark could be bigger than that of parquet data on disk. That is, I think that 24.59GB of parquet data becomes (18.1GB + 23.6GB) data in spark. Yes, you know cached data in spark also are compressed by

Re: sparkR not able to create /append new columns

2016-02-04 Thread Devesh Raj Singh
Thank you Rui Sun ! It is working now!! On Thu, Feb 4, 2016 at 9:21 AM, Sun, Rui wrote: > Devesh, > > > > Note that DataFrame is immutable. withColumn returns a new DataFrame > instead of adding a column in-pace to the DataFrame being operated. > > > > So, you can modify the

Re: Re: About cache table performance in spark sql

2016-02-04 Thread fightf...@163.com
Oh, thanks. Make sense to me. Best, Sun. fightf...@163.com From: Takeshi Yamamuro Date: 2016-02-04 16:01 To: fightf...@163.com CC: user Subject: Re: Re: About cache table performance in spark sql Hi, Parquet data are column-wise and highly compressed, so the size of deserialized rows in