Re: Using Apache Kylin as data source for Spark

2018-05-25 Thread Li Yang
That is very useful~~ :-) On Fri, May 18, 2018 at 11:56 AM, ShaoFeng Shi wrote: > Hello, Kylin and Spark users, > > A doc is newly added in Apache Kylin website on how to using Kylin as a > data source in Spark; > This can help the users who want to use Spark to

[Query] Weight of evidence on Spark

2018-05-25 Thread Aakash Basu
Hi guys, What's the best way to create feature column with Weight of Evidence calculated for categorical columns on target column (both Binary and Multi-Class)? Any insight? Thanks, Aakash.

Re: Submit many spark applications

2018-05-25 Thread yncxcw
hi, please try to reduce the default heap size for the machine you use to submit applications: For example: export _JAVA_OPTIONS="-Xmx512M" The submitter which is also a JVM does not need to reserve lots of memory. Wei -- Sent from:

Databricks 1/2 day certification course at Spark Summit

2018-05-25 Thread Sumona Routh
Hi all, My company just now approved for some of us to go to Spark Summit in SF this year. Unfortunately, the day long workshops on Monday are sold out now. We are considering what we might do instead. Have others done the 1/2 day certification course before? Is it worth considering? Does it

Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
I already gave my recommendation in my very first reply to this thread... On Fri, May 25, 2018 at 10:23 AM, raksja wrote: > ok, when to use what? > do you have any recommendation? > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >

Re: Submit many spark applications

2018-05-25 Thread raksja
ok, when to use what? do you have any recommendation? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
On Fri, May 25, 2018 at 10:18 AM, raksja wrote: > InProcessLauncher would just start a subprocess as you mentioned earlier. No. As the name says, it runs things in the same process. -- Marcelo - To

Re: Submit many spark applications

2018-05-25 Thread raksja
When you mean spark uses, did you meant this https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala? InProcessLauncher would just start a subprocess as you mentioned earlier. How about this, does this makes a rest api call to

Re: Submit many spark applications

2018-05-25 Thread Marcelo Vanzin
That's what Spark uses. On Fri, May 25, 2018 at 10:09 AM, raksja wrote: > thanks for the reply. > > Have you tried submit a spark job directly to Yarn using YarnClient. > https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html > > Not

Re: Submit many spark applications

2018-05-25 Thread raksja
thanks for the reply. Have you tried submit a spark job directly to Yarn using YarnClient. https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html Not sure whether its performant and scalable? -- Sent from:

Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Yong Zhang
I am not sure about Redshift, but I know the target table is not partitioned. But we should be able to just insert into non-partitioned remote table from 12 clients concurrently, right? Even let's say Redshift doesn't allow concurrently write, then Spark Driver will detect this and

Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Jörn Franke
Can your database receive the writes concurrently ? Ie do you make sure that each executor writes into a different partition at database side ? > On 25. May 2018, at 16:42, Yong Zhang wrote: > > Spark version 2.2.0 > > > We are trying to write a DataFrame to remote

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-25 Thread Chetan Khatri
Ajay, You can use Sqoop if wants to ingest data to HDFS. This is POC where customer wants to prove that Spark ETL would be faster than C# based raw SQL Statements. That's all, There are no time-stamp based columns in Source tables to make it incremental load. On Thu, May 24, 2018 at 1:08 AM, ayan

Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
Hi Jacek, This is exact what i'm looking for. Thanks!! Also thanks for the link. I just noticed that I can unfold the link of trigger and see the examples in java and scala languages - what a general help for a new comer :-)

Re: help with streaming batch interval question needed

2018-05-25 Thread Jacek Laskowski
Hi Peter, > Basically I need to find a way to set the batch-interval in (b), similar as in (a) below. That's trigger method on DataStreamWriter. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter import