Yarn containers getting killed, error 52, multiple joins

2017-04-13 Thread rachmaninovquartet
Hi, I have a spark 1.6.2 app (tested previously in 2.0.0 as well). It is requiring a ton of memory (1.5TB) for a small dataset (~500mb). The memory usage seems to jump, when I loop through and inner join to make the dataset 12 times as wide. The app goes down during or after this loop, when I try

Re: Running Hive and Spark together with Dynamic Resource Allocation

2016-10-31 Thread rachmaninovquartet
It seems like the best solution is to set: yarn.nodemanager.aux-services to mapred_shuffle,spark_shuffle -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Hive-and-Spark-together-with-Dynamic-Resource-Allocation-tp27968p27978.html Sent from the Apache

Running Hive and Spark together with Dynamic Resource Allocation

2016-10-27 Thread rachmaninovquartet
Hi, My team has a cluster running HDP, with Hive and Spark. We setup spark to use dynamic resource allocation, for benefits such as not having to hard code the number of executors and to free resources after using. Everything is running on YARN. The problem is that for Spark 1.5.2 with dynamic

Extra string added to column name? (withColumn & expr)

2016-08-18 Thread rachmaninovquartet
Hi, I'm trying to implement a custom one hot encoder, since I want the output to be a specific way, suitable to theano. Basically, it will give a new column for each distinct member of the original features and have it set to 1 if the observation contains the specific member of the distinct

Re: Large where clause StackOverflow 1.5.2

2016-08-18 Thread rachmaninovquartet
I solved this by using a Window partitioned by 'id'. I used lead and lag to create columns, which contained nulls in the places that I needed to delete, in each fold. I then removed those rows with the nulls and my additional columns. -- View this message in context:

Large where clause StackOverflow 1.5.2

2016-08-16 Thread rachmaninovquartet
Hi, I'm trying to implement a folding function in Spark, it takes an input k and a data frame of ids and dates. k=1 will be just the data frame, k=2 will, consist of the min and max date for each id once and the rest twice, k=3 will consist of min and max once, min+1 and max-1, twice and the rest

Strange behavior including memory leak and NPE

2016-07-19 Thread rachmaninovquartet
Hi, I've been fighting with a strange situation today. I'm trying to add two entries for each of the distinct rows of an account, except for the first and last (by date). Here's an example of some of the code. I can't get the subset to continue forward: var acctIdList =

Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
or would it be common practice to just retain the original categories in another df? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html Sent from the Apache Spark User List mailing list archive at

Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
Thanks Disha, that worked out well. Can you point me to an example of how to decode my feature vectors in the dataframe, back into their categories? -- View this message in context:

Dense Vectors outputs in feature engineering

2016-07-13 Thread rachmaninovquartet
Hi, I'm trying to use the StringIndexer and OneHotEncoder, in order to vectorize some of my features. Unfortunately, OneHotEncoder only returns sparse vectors. I can't find a way, much less an efficient one, to convert the columns generated by OneHotEncoder into dense vectors. I need this as I

SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread rachmaninovquartet
Hi, I'm trying to figure out how to work with R libraries in spark, properly. I've googled and done some trial and error. The main error, I've been running into is "cannot coerce class "structure("DataFrame", package = "SparkR")" to a data.frame". I'm wondering if there is a way to use the R