Re: recommendProductsForUser for a subset of user

2016-02-03 Thread Sabarish Sasidharan
You could always construct a new MatrixFactorizationModel with your filtered set of user features and product features. I believe its just a stateless wrapper around the actual rdds. Regards Sab On Wed, Feb 3, 2016 at 5:28 AM, Roberto Pagliari wrote: > When using

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread satish chandra j
Hi Hemant, My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy in the first step And also tried having "orderBy" method before "groupBy" than also getting different results in each iteration Regards, Satish Chandra On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat

DataFrame First method is resulting different results in each iteration

2016-02-03 Thread satish chandra j
HI All, I have data in a emp_df (DataFrame) as mentioned below: EmpId Sal DeptNo 001 100 10 002 120 20 003 130 10 004 140 20 005 150 10 ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as below: DeptNo Sal EmpId 10 150

Re: Guidelines for writing SPARK packages

2016-02-03 Thread Takeshi Yamamuro
Hi, A package I maintain (https://github.com/maropu/hivemall-spark) extends existing SparkSQL/DataFrame classes for a third-party library. Please use this as a concrete example. Thanks, takeshi On Tue, Feb 2, 2016 at 6:20 PM, Praveen Devarao wrote: > Thanks David. > > I

sparkR not able to create /append new columns

2016-02-03 Thread Devesh Raj Singh
Hi, i am trying to create dummy variables in sparkR by creating new columns for categorical variables. But it is not appending the columns df <- createDataFrame(sqlContext, iris) class(dtypes(df)) cat.column<-vector(mode="character",length=nrow(df)) cat.column<-collect(select(df,df$Species))

spark metrics question

2016-02-03 Thread Matt K
Hi guys, I'm looking to create a custom sync based on Spark's Metrics System: https://github.com/apache/spark/blob/9f603fce78fcc997926e9a72dec44d48cbc396fc/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala If I want to collect metrics from the Driver, Master, and Executor nodes,

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
Yes, I didn't work out how to solve that - sorry On 3 February 2016 at 22:37, Devesh Raj Singh wrote: > Hi, > > but "withColumn" will only add once, if i want to add columns to the same > dataframe in a loop it will keep overwriting the added column and in the > end the

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new

Re: sparkR not able to create /append new columns

2016-02-03 Thread Devesh Raj Singh
Hi, but "withColumn" will only add once, if i want to add columns to the same dataframe in a loop it will keep overwriting the added column and in the end the last added column( in the loop) will be the added column. like in my code above. On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter

Spark Streaming: Dealing with downstream services faults

2016-02-03 Thread Udo Fholl
Hi all, I need to send to an external service the result of our aggregations. I need to make sure that these results are actually sent. My current approach is to send them in an invocation of "foreachRDD". But how that is going to work with failures? Should I instead use "mapWithState" then

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Missing order by? Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j wrote: > HI All, > I have data in a emp_df (DataFrame) as mentioned below: > > EmpId Sal DeptNo > 001 100 10 > 002 120 20 > 003

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
What I meant is executor.cores and task.cpus can dictate how many parallel tasks will run on given executor. Let's take this example setting. spark.executor.memory = 16GB spark.executor.cores = 6 spark.task.cpus = 1 SO here I think spark will assign 6 tasks to One executor each using 1 core and

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
It was just renamed recently: https://github.com/apache/spark/pull/10981 As SessionState is entirely managed by Spark’s code, it still seems like this is a bug with Spark 1.6.0, and not with how our application is using HiveContext. But I’d feel more confident filing a bug if someone else

Cassandra BEGIN BATCH

2016-02-03 Thread FrankFlaherty
Cassandra provides "BEGIN BATCH" and "APPLY BATCH" to perform atomic execution of multiple statements as below: BEGIN BATCH INSERT INTO "user_status_updates" ("username", "id", "body") VALUES( 'dave', 16e2f240-2afa-11e4-8069-5f98e903bf02, 'dave update 4' ); INSERT INTO

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
I know it;s a strong word but when I have a case open for that with MapR and Databricks for a month and their only solution to change to DataFrame it frustrate you. I know DataFrame/Sql catalyst has internal optimizations but it requires lot of code change. I think there's something fundamentally

Nearest neighbors in Spark with Annoy

2016-02-03 Thread apu mishra . rr
As mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy for Approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected. Below is complete code for reproducibility.

Re: Spark 1.5.2 memory error

2016-02-03 Thread Jerry Lam
Hi guys, I was processing 300GB data with lot of joins today. I have a combination of RDD->Dataframe->RDD due to legacy code. I have memory issues at the beginning. After fine-tuning those configurations that many already suggested above, it works with 0 task failed. I think it is fair to say any

Low latency queries much slower in 1.6.0

2016-02-03 Thread Younes Naguib
Hi all, Since 1.6.0, low latency query are much slower now. This seems to be connected to the multi-user in the thrift-server. So on any newly created session, jobs are added to fill the session cache with information related to the tables it queries. Here is the details for this job: load at

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Awesome! it looks promising. Thanks Rishabh and Marcelo. On Wed, Feb 3, 2016 at 12:09 PM, Rishabh Wadhawan wrote: > Check out this link > http://spark.apache.org/docs/latest/configuration.html and check > spark.shuffle.service. Thanks > > On Feb 3, 2016, at 1:02 PM,

RE: spark-cassandra

2016-02-03 Thread Mohammed Guller
Another thing to check is what version of the Spark-Cassandra-Connector the Spark Job server passing to the workers. It looks like when you use Spark-submit, you are sending the correct SCC jar, but the Spark Job server may be using a different one. Mohammed Author: Big Data Analytics with

RE: Spark 1.5.2 memory error

2016-02-03 Thread Mohammed Guller
Nirav, Sorry to hear about your experience with Spark; however, sucks is a very strong word. Many organizations are processing a lot more than 150GB of data with Spark. Mohammed Author: Big Data Analytics with Spark

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Nirav Patel
Awesome! I just read design docs. That is EXACTLY what I was talking about! Looking forward to it! Thanks On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers wrote: > yeah there was some discussion about adding them to RDD, but it would > break a lot. so Dataset was born. > > yes

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
In ClientWrapper.scala, the SessionState.get().getConf call might have been executed ahead of SessionState.start(state) at line 194. This was the JIRA: [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL In master branch, there is no more ClientWrapper.scala FYI On Wed, Feb 3,

Re: Low latency queries much slower in 1.6.0

2016-02-03 Thread Rishabh Wadhawan
Hi Younes. When you have multiple user connected to hive, or you have multiple applications trying to access a shared memory. My recommendation would be to store it to a off-heap rather then disk. Checkout this link and check RDD Persistence

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Michael Armbrust
On Wed, Feb 3, 2016 at 1:42 PM, Nirav Patel wrote: > Awesome! I just read design docs. That is EXACTLY what I was talking > about! Looking forward to it! > Great :) Most of the API is there in 1.6. For the next release I would like to unify DataFrame <-> Dataset and do

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
About OP. How many cores you assign per executor? May be reducing that number will give more portion of executor memory to each task being executed on that executor. Others please comment if that make sense. On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel wrote: > I know

Spark Streaming: My kafka receivers are not consuming in parallel

2016-02-03 Thread Jorge Rodriguez
Hello Spark users, We are setting up our fist bach of spark streaming pipelines. And I am running into an issue which I am not sure how to resolve, but seems like should be fairly trivial. I am using receiver-mode Kafka consumer that comes with Spark, and running in standalone mode. I've setup

Parquet StringType column readable as plain-text despite being Gzipped

2016-02-03 Thread Sung Hwan Chung
Hello, We are using the default compression codec for Parquet when we store our dataframes. The dataframe has a StringType column whose values can be upto several MBs large. The funny thing is that once it's stored, we can browse the file content with a plain text editor and see large portions

Re: Spark 1.5.2 memory error

2016-02-03 Thread Rishabh Wadhawan
As of what I know, Cores won’t give you more portion of executor memory, because its just cpu cores that you are using per executor. Reducing the number of cores however would result in lack of parallel processing power. The executor memory that we specify with spark.executor.memory would be

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-02-03 Thread Alexandr Dzhagriev
Hi Sebastian, Do you have any updates on the issue? I faced with pretty the same problem and disabling kryo + raising the spark.network.timeout up to 600s helped. So for my job it takes about 5 minutes to broadcast the variable (~5GB in my case) but then it's fast. I mean much faster than

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
Hi Matt, there is some related work I recently did in IBM Research for visualizing the metrics produced. You can read about it here http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/ We recently opensourced it if you are interested to

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
Hi Jerry, I agree code + framework goes hand in hand. I am totally in for tuning the hack out of system as well. Spark offers tremendous flexibility in that regards. We have real time application that serves data in ms backed by spark rdds. It took lot of testing and tuning effort before we

RE: Cassandra BEGIN BATCH

2016-02-03 Thread Mohammed Guller
Frank, I don’t think so. Cassandra does not support transactions in the traditional sense. It is not an ACID compliant database. Mohammed Author: Big Data Analytics with Spark From: Ted Yu [mailto:yuzhih...@gmail.com]

Re: spark metrics question

2016-02-03 Thread Yiannis Gkoufas
Hi Matt, does the custom class you want to package reports metrics of each Executor? Thanks On 3 February 2016 at 15:56, Matt K wrote: > Thanks for sharing Yiannis, looks very promising! > > Do you know if I can package a custom class with my application, or does > it

Re: Spark Streaming: My kafka receivers are not consuming in parallel

2016-02-03 Thread Jorge Rodriguez
Please ignore this question, as i've figured out what my problem was. In the case that anyone else runs into something similar, the problem was on the kafka side. I was using the console producer to generate the messages going into the kafka logs. This producer will send all of the messages to

SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

2016-02-03 Thread Yiannis Gkoufas
Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level metrics of Spark. This is the result of the work we have been doing recently in IBM Research around Spark.

How parquet file decide task number?

2016-02-03 Thread Gavin Yue
I am doing a simple count like: sqlContext.read.parquet("path").count I have only 5000 parquet files. But generate over 2 tasks. Each parquet file is converted from one gz text file. Please give some advice. Thanks

clear cache using spark sql cli

2016-02-03 Thread fightf...@163.com
Hi, How could I clear cache (execute sql query without any cache) using spark sql cli ? Is there any command available ? Best, Sun. fightf...@163.com

RE: sparkR not able to create /append new columns

2016-02-03 Thread Sun, Rui
Devesh, Note that DataFrame is immutable. withColumn returns a new DataFrame instead of adding a column in-pace to the DataFrame being operated. So, you can modify the for loop like: for (j in 1:lev) { dummy.df.new<-withColumn(df, paste0(colnames(cat.column),j),

Re: How parquet file decide task number?

2016-02-03 Thread Gavin Yue
Found the answer. It is the block size. Thanks. On Wed, Feb 3, 2016 at 5:05 PM, Gavin Yue wrote: > I am doing a simple count like: > > sqlContext.read.parquet("path").count > > I have only 5000 parquet files. But generate over 2 tasks. > > Each parquet file is

Re: clear cache using spark sql cli

2016-02-03 Thread Ted Yu
Have you looked at SPARK-5909 Add a clearCache command to Spark SQL's cache manager On Wed, Feb 3, 2016 at 7:16 PM, fightf...@163.com wrote: > Hi, > How could I clear cache (execute sql query without any cache) using spark > sql cli ? > Is there any command available ? >

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
Create a pull request: https://github.com/apache/spark/pull/11066 FYI On Wed, Feb 3, 2016 at 1:27 PM, Shipper, Jay [USA] wrote: > It was just renamed recently: https://github.com/apache/spark/pull/10981 > > As SessionState is entirely managed by Spark’s code, it still

Re: Re: clear cache using spark sql cli

2016-02-03 Thread fightf...@163.com
Hi, Ted Yes. I had seen that issue. But it seems that in spark-sql cli cannot do command like : sqlContext.clearCache() Is this right ? In spark-sql cli I can only run some sql queries. So I want to see if there are any available options to reach this. Best, Sun. fightf...@163.com

Re: Re: clear cache using spark sql cli

2016-02-03 Thread Ted Yu
In spark-shell, I can do: scala> sqlContext.clearCache() Is that not the case for you ? On Wed, Feb 3, 2016 at 7:35 PM, fightf...@163.com wrote: > Hi, Ted > Yes. I had seen that issue. But it seems that in spark-sql cli cannot do > command like : >

Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m getting a NullPointerException from HiveContext. It’s happening while it tries to load some tables via JDBC from an external database (not Hive), using context.read().jdbc(): — java.lang.NullPointerException at

Re: spark metrics question

2016-02-03 Thread Matt K
Thanks for sharing Yiannis, looks very promising! Do you know if I can package a custom class with my application, or does it have to be pre-deployed on all Executor nodes? On Wed, Feb 3, 2016 at 10:36 AM, Yiannis Gkoufas wrote: > Hi Matt, > > there is some related work I

Spark 1.5 Streaming + Kafka 0.9.0

2016-02-03 Thread Pavel Sýkora
Hi, According to the Spark docs, Spark Streaming 1.5 (and 1.6) is compatible with Kafka 0.8.2.1 (Direct Kafka API). Nevertheless, I need to use Kafka 0.9.0 with Spark 1.5.x streaming. I tried to use Kafka 0.9.0 as both source and output of Spark 1.5 Streaming, but it seems it works well. 

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
Right, I could already tell that from the stack trace and looking at Spark’s code. What I’m trying to determine is why that’s coming back as null now, just from upgrading Spark to 1.6.0. From: Ted Yu > Date: Wednesday, February 3, 2016 at 12:04

Re: Spark 1.5.2 memory error

2016-02-03 Thread Ted Yu
There is also (deprecated) spark.storage.unrollFraction to consider On Wed, Feb 3, 2016 at 2:21 PM, Nirav Patel wrote: > What I meant is executor.cores and task.cpus can dictate how many parallel > tasks will run on given executor. > > Let's take this example setting. > >

Re: Cassandra BEGIN BATCH

2016-02-03 Thread Ted Yu
Seems you can find faster response on Cassandra Connector mailing list. On Wed, Feb 3, 2016 at 1:45 PM, FrankFlaherty wrote: > Cassandra provides "BEGIN BATCH" and "APPLY BATCH" to perform atomic > execution of multiple statements as below: > > BEGIN BATCH > INSERT

Re: Re: About cache table performance in spark sql

2016-02-03 Thread fightf...@163.com
Hi, Thanks a lot for your explaination. I know that the slow process mainly caused by GC pressure and I had understand this difference just from your advice. I had each executor memory with 6GB and try to cache table. I had 3 executors and finally I can see some info from the spark job ui

Re: Is there a any plan to develop SPARK with c++??

2016-02-03 Thread Benjamin Kim
Hi DaeJin, The closest thing I can think of is this. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Cheers, Ben > On Feb 3, 2016, at 9:49 PM, DaeJin Jung wrote: > > hello everyone, > I have a short question. > > I would

Re: Re: clear cache using spark sql cli

2016-02-03 Thread fightf...@163.com
No. That is not my case. Actually I am running spark-sql , which is in spark-sql cli mode, and execute sql queries against my hive tables. In spark-sql cli, there seems no exsiting sqlContext or sparkContext, only I can run some select/create/insert/delete operations. Best, Sun.

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Ahh.. missed that. I see that you have used "first" function. 'first' returns the first row it has found. On a single executor it may return the right results. But, on multiple executors, it will return the first row of any of the executor which may not be the first row when the results are

Re: About cache table performance in spark sql

2016-02-03 Thread Prabhu Joseph
Sun, When Executor don't have enough memory and if it tries to cache the data, it spends lot of time on GC and hence the job will be slow. Either, 1. We should allocate enough memory to cache all RDD and hence the job will complete fast Or 2. Don't use cache when there is not enough

Is there a any plan to develop SPARK with c++??

2016-02-03 Thread DaeJin Jung
hello everyone, I have a short question. I would like to improve performance for SPARK framework using intel native instruction or etc.. So, I wonder if there are any plans to develop SPARK with C++ or C in the near future. Please let me know if you have any informantion. Best Regards, Daejin

About cache table performance in spark sql

2016-02-03 Thread fightf...@163.com
Hi, I want to make sure that the cache table indeed would accelerate sql queries. Here is one of my use case : impala table size : 24.59GB, no partitions, with about 1 billion+ rows. I use sqlContext.sql to run queries over this table and try to do cache and uncache command to see if there

Re: spark streaming web ui not showing the events - direct kafka api

2016-02-03 Thread vimal dinakaran
No I am using DSE 4.8 which has spark 1.4. Is this a known issue ? On Wed, Jan 27, 2016 at 11:52 PM, Cody Koeninger wrote: > Have you tried spark 1.5? > > On Wed, Jan 27, 2016 at 11:14 AM, vimal dinakaran > wrote: > >> Hi , >> I am using spark 1.4 with

Spark streaming archive results

2016-02-03 Thread Udo Fholl
Hi, I want to aggregate on a small window and send downstream every 30 secs. But I would also like to store in our archive the outcome every 20min. My current approach (simplified version) is: val stream = // val statedStream = stream.mapWithState(stateSpec) val archiveStream =

spark-cassandra

2016-02-03 Thread Madabhattula Rajesh Kumar
Hi, I am using Spark Jobserver to submit the jobs. I am using spark-cassandra connector to connect to Cassandra. I am getting below exception through spak jobserver. If I submit the job through *Spark-Submit *command it is working fine,. Please let me know how to solve this issue Exception in

Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-03 Thread Udo Fholl
Hi all, I recently migrated from 'updateStateByKey' to 'mapWithState' and now I see a huge increase of memory. Most of it is a massive "BlockGenerator" (which points to a massive "ArrayBlockingQueue" that in turns point to a huge "Object[]"). I'm pretty sure it has to do with my code, but I

Re: Spark with SAS

2016-02-03 Thread Jörn Franke
This could be done through odbc. Keep in mind that you can run SaS jobs directly on a Hadoop cluster using the SaS embedded process engine or dump some data to SaS lasr cluster, but you better ask SaS about this. > On 03 Feb 2016, at 18:43, Sourav Mazumder wrote: >

Re: question on spark.streaming.kafka.maxRetries

2016-02-03 Thread Cody Koeninger
KafkaRDD will use the standard kafka configuration parameter refresh.leader.backoff.ms if it is set in the kafkaParams map passed to createDirectStream. On Tue, Feb 2, 2016 at 9:10 PM, Chen Song wrote: > For Kafka direct stream, is there a way to set the time between

Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
Looks like the NPE came from this line: def conf: HiveConf = SessionState.get().getConf Meaning SessionState.get() returned null. On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] wrote: > I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m > getting a

Spark with SAS

2016-02-03 Thread Sourav Mazumder
Hi, Is anyone aware of any work going on for integrating Spark with SAS for executing queries in Spark? For example calling Spark Jobs from SAS using Spark SQL through Spark SQL's JDBC/ODBC library. Regards, Sourav

Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-02-03 Thread Karl Higley
Hi Alan, I'm slow responding, so you may have already figured this out. Just in case, though: val approx = mat.columnSimilarities(0.1) approxEntries.first() res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676) The above is returning the cosine similarity between columns 1638

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
Hi Stefan, Welcome to the OOM - heap space club. I have been struggling with similar errors (OOM and yarn executor being killed) and failing job or sending it in retry loops. I bet the same job will run perfectly fine with less resource on Hadoop MapReduce program. I have tested it for my program

Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Hi, I have a spark job running on yarn-client mode. At some point during Join stage, executor(container) runs out of memory and yarn kills it. Due to this Entire job restarts! and it keeps doing it on every failure? What is the best way to checkpoint? I see there's checkpoint api and other

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Marcelo Vanzin
Without the exact error from the driver that caused the job to restart, it's hard to tell. But a simple way to improve things is to install the Spark shuffle service on the YARN nodes, so that even if an executor crashes, its shuffle output is still available to other executors. On Wed, Feb 3,

RE: Spark 1.5.2 memory error

2016-02-03 Thread Stefan Panayotov
I drastically increased the memory: spark.executor.memory = 50g spark.driver.memory = 8g spark.driver.maxResultSize = 8g spark.yarn.executor.memoryOverhead = 768 I still see executors killed, but this time the memory does not seem to be the issue. The error on the Jupyter notebook is:

Connect to two different HDFS servers with different usernames

2016-02-03 Thread Wayne Song
Is there any way to get data from HDFS (e.g. with sc.textFile) with two separate usernames in the same Spark job? For instance, if I have a file on hdfs-server-1.com and the alice user has permission to view it, and I have a file on hdfs-server-2.com and the bob user has permission to view it,

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
One quick update on this: The NPE is not happening with Spark 1.5.2, so this problem seems specific to Spark 1.6.0. From: Jay Shipper > Date: Wednesday, February 3, 2016 at 12:06 PM To: "user@spark.apache.org"

Re: Spark with SAS

2016-02-03 Thread Benjamin Kim
You can download the Spark ODBC Driver. https://databricks.com/spark/odbc-driver-download > On Feb 3, 2016, at 10:09 AM, Jörn Franke wrote: > > This could be done through odbc. Keep in mind that you can run SaS jobs > directly on a Hadoop cluster using the SaS embedded

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Do you mean this setup? https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin wrote: > Without the exact error from the driver that caused the job to restart, > it's hard to tell. But a simple

Re: Spark 1.5.2 memory error

2016-02-03 Thread Rishabh Wadhawan
Hi I suppose you are using —master yarn-client or yarn cluster. Can you try boosting spark.yarn.driver.memoryOverhead, override it to 0.15 * executor memory rather then default 0.1. Check out this link https://spark.apache.org/docs/1.5.2/running-on-yarn.html

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Marcelo Vanzin
Yes, but you don't necessarily need to use dynamic allocation (just enable the external shuffle service). On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel wrote: > Do you mean this setup? > > https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Rishabh Wadhawan
Hi Nirav There is a difference between dynamic resource allocation and shuffle service. The dynamic allocation when you enable the configurations for it, every time you run any task spark will determine the number of executors required to run that task for you, which means decreasing the

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Rishabh Wadhawan
Check out this link http://spark.apache.org/docs/latest/configuration.html and check spark.shuffle.service. Thanks > On Feb 3, 2016, at 1:02 PM, Marcelo Vanzin wrote: > > Yes, but you don't necessarily need to use