Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Sun Rui
You can simply save the join result distributedly, for example, as a HDFS file, and then copy the HDFS file to a local file. There is an alternative memory-efficient way to collect distributed data back to driver other than collect(), that is toLocalIterator. The iterator will consume as much

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Taotao.Li
hi, consider transfer dataframe to rdd and then use* rdd.toLocalIterator *to collect data on the driver node. On Fri, Jul 15, 2016 at 9:05 AM, Pedro Rodriguez wrote: > Out of curiosity, is there a way to pull all the data back to the driver > to save without collect()?

How to recommend most similar users using Spark ML

2016-07-14 Thread jeremycod
Hi, I need to develop a service that will recommend user with other similar users that he can connect to. For each user I have a data about user preferences for specific items in the form: user, item, preference 1,75, 0.89 2,168, 0.478 2,99, 0.321 3,31, 0.012 So

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread Sunita Arvind
Thank you for your inputs. Will test it out and share my findings On Thursday, July 14, 2016, CosminC wrote: > Didn't have the time to investigate much further, but the one thing that > popped out is that partitioning was no longer working on 1.6.1. This would > definitely

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Pedro Rodriguez
Out of curiosity, is there a way to pull all the data back to the driver to save without collect()? That is, stream the data in chunks back to the driver so that maximum memory used comparable to a single node’s data, but all the data is saved on one node. — Pedro Rodriguez PhD Student in

How to check if a data frame is cached?

2016-07-14 Thread Cesar
Is there a simpler way to check if a data frame is cached other than: dataframe.registerTempTable("cachedOutput") assert(hc.isCached("cachedOutput"), "The table was not cached") Thanks! -- Cesar Flores

Re: Maximum Size of Reference Look Up Table in Spark

2016-07-14 Thread Jacek Laskowski
Hi, My understanding is that the maximum size of a broadcast is the Long.MAX_VALUE (and plus some more since the data is going to be encoded to save space, esp. for catalyst-driver datasets). Ad 2. Before the tasks access the broadcast variable it has to be sent across network that may be too

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Jacek Laskowski
Hi, Please re-consider your wish since it is going to move all the distributed dataset to the single machine of the driver and may lead to OOME. It's more pro to save your result to HDFS or S3 or any other distributed filesystem (that is accessible by the driver and executors). If you insist...

Saving data frames on Spark Master/Driver

2016-07-14 Thread vr.n. nachiappan
Hello, I am using data frames to join two cassandra tables. Currently when i invoke save on data frames as shown below it is saving the join results on executor nodes.  joineddataframe.select(, ...).format("com.databricks.spark.csv").option("header", "true").save() I would like to persist the

SparkStreaming multiple output operations failure semantics / error propagation

2016-07-14 Thread Martin Eden
Hi, I have a Spark 1.6.2 streaming job with multiple output operations (jobs) doing idempotent changes in different repositories. The problem is that I want to somehow pass errors from one output operation to another such that in the current output operation I only update previously successful

Maximum Size of Reference Look Up Table in Spark

2016-07-14 Thread Saravanan Subramanian
Hello All, I am in the middle of designing real time data enhancement services using spark streaming.  As part of this, I have to look up some reference data while processing the incoming stream. I have below questions: 1) what is the maximum size of look up table / variable can be stored as

Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-14 Thread Tobi Bosede
Hi everyone, I am trying to filter my features based on the spark.mllib ChiSqSelector. filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, model.transform(lp.features))) However when I do the following I get the error below. Is there any other way to filter my data to avoid

Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
I witness really weird behavior when loading the data from RDBMS. I tried different approach for loading the data - I provided a partitioning column for make partitioning parallelism: val df_init = sqlContext.read.format("jdbc").options( Map("url" -> Configuration.dbUrl,

Re: Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

2016-07-14 Thread Daniel Santana
Are you re-sharding your kinesis stream as well? I had a similar problem and increasing the number of kinesis stream shards solved it. -- *Daniel Santana* Senior Software Engineer EVERY*MUNDO* 25 SE 2nd Ave., Suite 900 Miami, FL 33131 USA main:+1 (305) 375-0045 EveryMundo.com

Re: Call http request from within Spark

2016-07-14 Thread Ted Yu
Second to what Pedro said in the second paragraph. Issuing http request per row would not scale. On Thu, Jul 14, 2016 at 12:26 PM, Pedro Rodriguez wrote: > Hi Amit, > > Have you tried running a subset of the IDs locally on a single thread? It > would be useful to

Re: Call http request from within Spark

2016-07-14 Thread Pedro Rodriguez
Hi Amit, Have you tried running a subset of the IDs locally on a single thread? It would be useful to benchmark your getProfile function for a subset of the data then estimate how long the full data set would take then divide by number of spark executor cores. This should at least serve as a

HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-07-14 Thread Chang Lim
Hi, I am on Spark 2.0 Review release. According to Spark 2.0 docs, to share TempTable/View, I need to: "to run the Thrift server in the old single-session mode, please set option spark.sql.hive.thriftServer.singleSession to true." Question: *When using HiveThriftServer2.startWithContext(),

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
I meant to say that first we can sort the individual partitions and then sort them again by merging. Sort of a divide and conquer mechanism. Does sortByKey take care of all this internally? On Fri, Jul 15, 2016 at 12:08 AM, Punit Naik wrote: > Can we increase the sorting

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Can we increase the sorting speed of RDD by doing a secondary sort first? On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik wrote: > Okay. Can't I supply the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On 14-Jul-2016

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Okay. Can't I supply the same partitioner I used for "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? On 14-Jul-2016 11:38 PM, "Koert Kuipers" wrote: > repartitionAndSortWithinPartitions partitions the rdd and sorts within > each partition. so each

Re: Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

2016-07-14 Thread Renxia Wang
Additional information: The batch duration in my app is 1 minute, from Spark UI, for each batch, the difference between Output Op Duration and Job Duration is big. E.g. Output Op Duration is 1min while Job Duration is 19s. 2016-07-14 10:49 GMT-07:00 Renxia Wang : > Hi all,

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartitions partitions the rdd and sorts within each partition. so each partition is fully sorted, but the rdd is not sorted. sortByKey is basically the same as repartitionAndSortWithinPartitions except it uses a range partitioner so that the entire rdd is sorted. however

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Hi Koert I have already used "repartitionAndSortWithinPartitions" for secondary sorting and it works fine. Just wanted to know whether it will sort the entire RDD or not. On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers wrote: > repartitionAndSortWithinPartit sort by keys,

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartit sort by keys, not values per key, so not really secondary sort by itself. for secondary sort also check out: https://github.com/tresata/spark-sorted On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote: > Hi guys > > In my spark/scala code I

Spark Streaming Kinesis Performance Decrease When Cluster Scale Up with More Executors

2016-07-14 Thread Renxia Wang
Hi all, I am running a Spark Streaming application with Kinesis on EMR 4.7.1. The application runs on YARN and use client mode. There are 17 worker nodes (c3.8xlarge) with 100 executors and 100 receivers. This setting works fine. But when I increase the number of worker nodes to 50, and increase

Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
HI Talebzadeh, sorry I forget to answer last part of your question: At O/S level you should see many CoarseGrainedExecutorBackend through jps each corresponding to one executor. Are they doing anything? There is one worker with one executor bussy and the rest is almost idle: PID USER PR

Re: Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
HI Talebzadeh, we are using 6 worker machines - running. We are reading the data through sqlContext (data frame) as it is suggested in the documentation over the JdbcRdd prop just specifies name, password, and driver class. Right after this data load we register it as a temp table val

repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Punit Naik
Hi guys In my spark/scala code I am implementing secondary sort. I wanted to know, when I call the "repartitionAndSortWithinPartitions" method, the whole (entire) RDD will be sorted or only the individual partitions will be sorted? If its the latter case, will applying a "sortByKey" after

Re: Standalone cluster node utilization

2016-07-14 Thread Mich Talebzadeh
Hi Jakub, Sounds like one executor. Can you point out: 1. The number of slaves/workers you are running 2. Are you using JDBC to read data in? 3. Do you register DF as temp table and if so have you cached temp table Sounds like only one executor is active and the rest are sitting

Re: Standalone cluster node utilization

2016-07-14 Thread Zhou (Joe) Xing
i have seen similar behavior in my standalone cluster, I tried to increase the number of partitions and at some point it seems all the executors or worker nodes start to make parallel connection to remote data store. But it would be nice if someone could point us to some references on how to

Difference JavaReceiverInputDStream and JavaDStream

2016-07-14 Thread Paolo Patierno
Hi all, what is the difference between JavaReceiverInputDStream and JavaDStream ? I see that the last one is always used in alla custom receiver when the createStream is going to be used for Python. Thanks, Paolo.

Standalone cluster node utilization

2016-07-14 Thread Jakub Stransky
Hello, I have a spark cluster running in a single mode, master + 6 executors. My application is reading a data from database via DataFrame.read then there is a filtering of rows. After that I re-partition data and I wonder why on the executors page of the driver UI I see RDD blocks all

Call http request from within Spark

2016-07-14 Thread Amit Dutta
Hi All, I have a requirement to call a rest service url for 300k customer ids. Things I have tried so far is custid_rdd = sc.textFile('file:Users/zzz/CustomerID_SC/Inactive User Hashed LCID List.csv') #getting all the customer ids and building adds profile_rdd = custid_rdd.map(lambda r:

ranks and cubes

2016-07-14 Thread talgr
I have a dataframe with few dimensions, for example: I want to build a cube on i,j,k, and get a rank based on total per row (per grouping) so that when doing: df.filter('i===3 && 'j===1).show I will get so basically, for any grouping combination, i need a separated dense rank list (i,j,k,

Re: ranks and cubes

2016-07-14 Thread talgr
I'm posting again, as the tables are not showing up in the emails.. I have a dataframe with few dimensions, for example: +---+---+---+-+ | i| j| k|total| +---+---+---+-+ | 3| 1| 1|3| | 3| 1| 2|6| | 3| 1| 3|9| | 3| 1| 4| 12| | 3| 1| 5| 15| | 3| 1| 6|

Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
or would it be common practice to just retain the original categories in another df? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html Sent from the Apache Spark User List mailing list archive at

Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
Thanks Disha, that worked out well. Can you point me to an example of how to decode my feature vectors in the dataframe, back into their categories? -- View this message in context:

Is it possible to send CSVSink metrics to HDFS

2016-07-14 Thread johnbutcher
Hi, (first ever post) I experimenting with a Cloudera CDH5 cluster with Spark 1.5.0. Have tried enabling the CSVSink metrics which seems to work to linux directories such as /tmp. However, I'm getting errors when trying to send to an HDFS directory. Is it possible to use HDFS? Error message

Re: Issue in spark job. Remote rpc client dissociated

2016-07-14 Thread Balachandar R.A.
Hello, The variable argsList is an array defined above the parallel block. This variawis accessed inside the map function. Launcher.main is not threadsafe. Is is not possible to specify to spark that every folder needs to be processed as a separate process in a separate working directory?

Re: Issue in spark job. Remote rpc client dissociated

2016-07-14 Thread Sun Rui
Where is argsList defined? is Launcher.main() thread-safe? Note that if multiple folders are processed in a node, multiple threads may concurrently run in the executor, each processing a folder. > On Jul 14, 2016, at 12:28, Balachandar R.A. wrote: > > Hello Ted, >

Re: Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-07-14 Thread CosminC
Didn't have the time to investigate much further, but the one thing that popped out is that partitioning was no longer working on 1.6.1. This would definitely explain the 2x performance loss. Checking 1.5.1 Spark logs for the same application showed that our partitioner was working correctly, and