Re: scalastyle violation on mvn install but not on mvn package

2017-05-04 Thread Mark Hamstra
The check goal of the scalastyle plugin runs during the "verify" phase, which is between "package" and "install"; so running just "package" will not run scalastyle:check. On Thu, May 4, 2017 at 7:45 AM, yiskylee wrote: > ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0

long running jobs with Spark

2017-05-04 Thread Afshin, Bardia
Starting long running jobs with upstarts on linux (spark-submit) is super slow. I can see only a small percentage of the CPU is being utilized and applying nice –n 20 to the process doesn’t seem to do anything. Anyone dealt with long running processes / jobs on Spark and has any best practices

Spark Streaming 2.1 - slave parallel recovery

2017-05-04 Thread Dominik Safaric
Hi all, I’m running cluster consisting of a master and four slaves. The cluster runs a Spark application that reads data from a Kafka topic over a window of time, and writes the data back to Kafka. Checkpointing is enabled by using HDFS. However, although Spark periodically commits checkpoints

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-04 Thread Gene Pang
As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help you. Here is some documentation on how to run Alluxio and Spark together , and here is a blog post on a Spark streaming + Alluxio use case

scalastyle violation on mvn install but not on mvn package

2017-05-04 Thread yiskylee
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package works, but ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean install triggers scalastyle violation error. Is the scalastyle check not used on package but only on install? To install, should I

Re: unable to find how to integrate SparkSession with a Custom Receiver.

2017-05-04 Thread kant kodali
got it! Thank you! On Thu, May 4, 2017 at 12:58 AM, Tathagata Das wrote: > Structured Streaming is not designed to integrate with receivers. The > sources in Structured Streaming are designed for providing stronger > fault-tolerance guarantees by precisely tracking

Re: Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread Abel Rincón
Hi Mathieu, Stratio is working on it, we have a solution running which accomplish our use case, could you share your use case with us? Here you have video and slides of our work on this topic https://spark-summit.org/east-2017/events/kerberizing-spark/ Regards Abel. 2017-05-04 15:01

Re: Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread Saisai Shao
Current Spark doesn't support impersonate different users at run-time. Current Spark's proxy user is application level, which means when setting through --proxy-user the whole application will be running with that user. On Thu, May 4, 2017 at 5:13 PM, matd wrote: > Hi folks,

Normalize columns items for Onehotencoder

2017-05-04 Thread issues solution
Hi, I have 3 data frame with not same items inside labled values i mean : data frame 1 collabled a b c dataframe2 collabled a w z when i enode the first data fram i get collabled ab c a1 0 0 b 01 0 c

RE: [Spark Streaming] - Killing application from within code

2017-05-04 Thread Sidney Feiner
Instead of setting up an additional mechanism, would it be "clean" to catch the error back in the driver, and use SparkContext.stop() there? And beause the SparkContext can’t be serialized, I can't catch the error inside the rdd.foreach function. What I did eventually and it worked:

Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread matd
Hi folks, I have a Spark application executing various jobs for different users simultaneously, via several Spark sessions on several threads. My customer would like to kerberize his hadoop cluster. I wonder if there is a way to configure impersonation such as each of these jobs would be ran

Re: Create multiple columns in pyspak with one shot

2017-05-04 Thread Rick Moritz
In Scala you can first define your columns, and then use the list-to-vararg-expander :_* in a select call, something like this: val cols = colnames.map(col).map(column => { *lit(0)* }) dF.select(cols: _*) I assume something similar should be possible in Java as well, from your snippet it's

Re: unable to find how to integrate SparkSession with a Custom Receiver.

2017-05-04 Thread Tathagata Das
Structured Streaming is not designed to integrate with receivers. The sources in Structured Streaming are designed for providing stronger fault-tolerance guarantees by precisely tracking records by their offsets (e.g. Kafka offsets). This is different from the Receiver APIs which did not require

Create multiple columns in pyspak with one shot

2017-05-04 Thread issues solution
Hi , How we can create multiple columns iteratively i mean how you can create empty columns inside loop because : with for i in listl : df = df.withcolumn(i,F.lit(0)) we get stackoverflow how we can do that inside list of columns like that df.select([F.col(i).lit(0) for i in

unable to find how to integrate SparkSession with a Custom Receiver.

2017-05-04 Thread kant kodali
Hi All, I have a Custom Receiver that implements onStart() and OnStop Methods of the Receiver class and I am trying to figure out how to integrate with SparkSession since I want to do stateful analytics using Structured Streaming. I couldn't find it in the docs. any idea? When I was doing

Re: Hive on Spark is not populating correct records

2017-05-04 Thread Vikash Pareek
After lots of expermiments, I have figured out that it was a potential bug in cloudera with Hive on Spark. Hive on Spark does not populate consistent output on aggregate functions. Hopefully, it will be fixed in next relaese. -- View this message in context:

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-04 Thread kant kodali
Thanks a lot! On Wed, May 3, 2017 at 4:36 PM, Michael Armbrust wrote: > if I do dataset.select("nonExistentColumn") then the Analysis Error is >> thrown at compile time right? >> > > if you do df.as[MyClass].map(_.badFieldName) you will get a compile > error. However,

any support to use Spark UDF in HIVE

2017-05-04 Thread Manohar753
HI , I have seen many hive udf are getting used in spark SQL,so is there any way to do it reverse.I want to write some code on spark for UDF and the same can be used in HIVE. please suggest me all possible approaches in spark with JAVA. Thaks in advance. Regards, Manoh -- View this message