Re: run spark job in yarn cluster mode as specified user

2018-01-22 Thread sd wang
Thanks! I finally make this work, except parameter LinuxContainerExecutor and cache directory permissions, the following parameter also need to be updated to specified user. yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user Thanks. 2018-01-22 22:44 GMT+08:00 Margusja

Spark SQL bucket pruning support

2018-01-22 Thread Joe Wang
Hi, I'm wondering if the current version of Spark still supports bucket pruning? I see the pull request that incorporated the change, but the logic to actually skip reading buckets has since been removed as part of other PRs

Re: Spark Tuning Tool

2018-01-22 Thread lucas.g...@gmail.com
I'd be very interested in anything I can send to my analysts to assist them with their troubleshooting / optimization... Of course our engineers would appreciate it as well. However we'd be way more interested if it was OSS. Thanks! Gary Lucas On 22 January 2018 at 21:16, Holden Karau

Re: Spark Tuning Tool

2018-01-22 Thread Holden Karau
That's very interesting, and might also get some interest on the dev@ list if it was open source. On Tue, Jan 23, 2018 at 4:02 PM, Roger Marin wrote: > I'd be very interested. > > On 23 Jan. 2018 4:01 pm, "Rohit Karlupia" wrote: > >> Hi, >> >> I have

Re: Spark Tuning Tool

2018-01-22 Thread Roger Marin
I'd be very interested. On 23 Jan. 2018 4:01 pm, "Rohit Karlupia" wrote: > Hi, > > I have been working on making the performance tuning of spark applications > bit easier. We have just released the beta version of the tool on Qubole. > >

Spark Tuning Tool

2018-01-22 Thread Rohit Karlupia
Hi, I have been working on making the performance tuning of spark applications bit easier. We have just released the beta version of the tool on Qubole. https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/ This is not OSS yet but we would like to contribute it to OSS. Fishing

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread vermanurag
Looking at description of problem window functions may solve your issue. It allows operation over a window that can include records before/ after the particular record -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread naresh Goud
If I understand your requirement correct. Use broadcast variables to replicate across all nodes the small amount of data you wanted to reuse. On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch wrote: > This seems like an easy thing to do, but I've been banging my head

How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread David Rosenstrauch
This seems like an easy thing to do, but I've been banging my head against the wall for hours trying to get it to work. I'm processing a spark dataframe (in python). What I want to do is, as I'm processing it I want to hold some data from one record in some local variables in memory, and then

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-22 Thread Tathagata Das
For computing mapGroupsWithState, can you check the following. - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithState) - How long each task is taking? - How many cores does the cluster have? On Thu, Jan 18, 2018 at

Re: Spark vs Snowflake

2018-01-22 Thread Patrick McCarthy
Last I heard of them a year or two ago, they basically repackage AWS services behind their own API/service layer for convenience. There's probably a value-add if you're not familiar with optimizing AWS, but if you already have that expertise I don't expect they would add much extra performance if

Spark vs Snowflake

2018-01-22 Thread Mich Talebzadeh
Hi, Has anyone had experience of using Snowflake which touts itself as data warehouse built for the cloud? In reviews one recommendation states "DEFINITELY AN ALTERNATIVE TO

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-22 Thread Weichen Xu
Hi Stephen, Agree with Nick said, the ML vs MLLib comparison test seems to be flawed. LR in Spark MLLib use SGD, in each iteration during training, SGD only sample a small fraction of data and do gradient computation, but in each iteration LBFGS need to aggregate over the whole input dataset. So

Re: Spark querying C* in Scala

2018-01-22 Thread Sathish Kumaran Vairavelu
You have to register a Cassandra table in spark as dataframes https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md Thanks Sathish On Mon, Jan 22, 2018 at 7:43 AM Conconscious wrote: > Hi list, > > I have a Cassandra table with two

Re: [EXT] How do I extract a value in foreachRDD operation

2018-01-22 Thread Toy
Thanks Michael, Can you give me an example? I'm new to Spark On Mon, 22 Jan 2018 at 12:25 Michael Mansour wrote: > Toy, > > > > I suggest your partition your data according to date, and use the > forEachPartition function, using the partition as the bucket

Production Critical : Data loss in spark streaming

2018-01-22 Thread KhajaAsmath Mohammed
Hi, I have been using the spark streaming with kafka. I have to restart the application daily due to kms issue and after restart the offsets are not matching with the point I left. I am creating checkpoint directory with val streamingContext = StreamingContext.getOrCreate(checkPointDir, () =>

Re: [EXT] How do I extract a value in foreachRDD operation

2018-01-22 Thread Michael Mansour
Toy, I suggest your partition your data according to date, and use the forEachPartition function, using the partition as the bucket location. This would require you to define a custom hash partitioner function, but that is not too difficult. -- Michael Mansour Data Scientist Symantec From: Toy

Re: spark 2.0 and spark 2.2

2018-01-22 Thread Xiao Li
Generally, the behavior changes in Spark SQL will be documented in https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide In the ongoing Spark 2.3 release, all the behavior changes in Spark SQL/DataFrame/Dataset that causes behavior changes are documented in this section.

Re: external shuffle service in mesos

2018-01-22 Thread Susan X. Huynh
Hi Igor, You made a good point about the tradeoffs. I think the main thing you would get with Marathon is the accounting for resources (the memory and cpus specified in the config file). That allows Mesos to manage the resources properly. I don't think the other tools mentioned would reserve

How do I extract a value in foreachRDD operation

2018-01-22 Thread Toy
Hi, We have a spark application to parse log files and save to S3 in ORC format. However, during the foreachRDD operation we need to extract a date field to be able to determine the bucket location; we partition it by date. Currently, we just hardcode it by current date, but we have a requirement

Re: [Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-22 Thread Matteo Cossu
Hello, I did not understand very well your question. However, I can tell you that if you do .collect() on a RDD you are collecting all the data in the driver node. For this reason, you should use it only when the RDD is very small. Your function "validate_hostname" depends on a DataFrame. It's not

spark 2.0 and spark 2.2

2018-01-22 Thread Mihai Iacob
Does spark 2.2 have good backwards compatibility? Is there something that won't work that works in spark 2.0?   Regards,  Mihai IacobDSX Local -

Re: run spark job in yarn cluster mode as specified user

2018-01-22 Thread Margusja
Hi org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor requires user in each node and right permissions set in necessary directories. Br Margus > On 22 Jan 2018, at 13:41, sd wang wrote: > >

Spark querying C* in Scala

2018-01-22 Thread Conconscious
Hi list, I have a Cassandra table with two fields; id bigint, kafka text My goal is to read only the kafka field (that is a JSON) and infer the schema Hi have this skeleton code (not working): sc.stop import org.apache.spark._ import com.datastax.spark._ import

Re: run spark job in yarn cluster mode as specified user

2018-01-22 Thread Jörn Franke
Configure Kerberos > On 22. Jan 2018, at 08:28, sd wang wrote: > > Hi Advisers, > When submit spark job in yarn cluster mode, the job will be executed by > "yarn" user. Any parameters can change the user? I tried setting > HADOOP_USER_NAME but it did not work. I'm

Re: run spark job in yarn cluster mode as specified user

2018-01-22 Thread sd wang
Hi Margus, Appreciate your help! Seems this parameter is related to CGroups functions. I am using CDH without kerberos, I set the parameter: yarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor Then run spark job again, hit the problem as

Spark and CEP type examples

2018-01-22 Thread Esa Heikkinen
Hi I am looking for simple examples using CEP (Complex Event Processing) with Scala and Python. Does anyone know good ones ? I do not need preprocessing (like in Kafka), but only analyzing phase of CEP inside Spark. I am also interested to know other possibilities to search sequential event

Using window function works extremely slowly

2018-01-22 Thread Anton Puzanov
I try to use spark sql built in window function: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String) I run it with step=1 seconds and window = 3 minutes (ratio of 180) and it runs extremely slow compared to other