spark-submit: Warning: Skip remote jar hdfs

2019-01-23 Thread Neo Chien
Hi Experts, I would like to submit a spark job with configuring additional jar on hdfs, however the hadoop gives me a warning on skipping remote jar. Although I can still get my final results on hdfs, I cannot obtain the effect of additional remote jar. I would appreciate if you can give me some

Re: I have trained a ML model, now what?

2019-01-23 Thread Felix Cheung
Please comment in the JIRA/SPIP if you are interested! We can see the community support for a proposal like this. From: Pola Yao Sent: Wednesday, January 23, 2019 8:01 AM To: Riccardo Ferrari Cc: Felix Cheung; User Subject: Re: I have trained a ML model, now

unsubscribe

2019-01-23 Thread Irtiza Ali
unsubscribe

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Serega Sheypak
Hi Imran, here is my usecase There is 1K nodes cluster and jobs have performance degradation because of a single node. It's rather hard to convince Cluster Ops to decommission node because of "performance degradation". Imagine 10 dev teams chase single ops team for valid reason (node has problems)

Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Shahab Yunus
Could be a tangential idea but might help: Why not use queryExecution and logicalPlan objects that are available when you execute a query using SparkSession and get a DataFrame back? The Json representation contains almost all the info that you need and you don't need to go to Hive to get this

Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Ramandeep Singh Nanda
Explain extended or explain would list the plan along with the tables. Not aware of any statements that explicitly list dependencies or tables directly. Regards, Ramandeep Singh On Wed, Jan 23, 2019, 11:05 Tomas Bartalos This might help: > > show tables; > > st 23. 1. 2019 o 10:43 napísal(a):

Re: SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-23 Thread Alistair Blair
Hi Xiangrui +1 It would be fantastic to see this functionality. Regards Alistair. On 2019/01/15 16:52:44, Xiangrui Meng wrote: > Hi all,> > > I want to re-send the previous SPIP on introducing a DataFrame-based graph> > component to collect more feedback. It supports property graphs,

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Imran Rashid
Serga, can you explain a bit more why you want this ability? If the node is really bad, wouldn't you want to decomission the NM entirely? If you've got heterogenous resources, than nodelabels seem like they would be more appropriate -- and I don't feel great about adding workarounds for the

Re: Create all the combinations of a groupBy

2019-01-23 Thread Pierremalliard
looks like i found the solution in case anyone ever encounters a similar challenge... df = spark.createDataFrame( [("a", 1, 0), ("a", 2, 42), ("a", 3, 10), ("b", 4, -1), ("b", 5, -2), ("b", 6, 12)], ("key", "consumerID", "feature") ) df.show() schema = StructType([ StructField("ID_1",

Re%3A SPIP%3A DataFrame-based Property Graphs%2C Cypher Queries%2C and Algorithms=

2019-01-23 Thread Alastair Green
Testing response -- there seems to a problem with replies to this thread. On 2019/01/15 16:52:44, Xiangrui Meng wrote: > Hi all,> > > I want to re-send the previous SPIP on introducing a DataFrame-based graph> > component to collect more feedback. It supports property graphs, Cypher> > graph

Re: Create all the combinations of a groupBy

2019-01-23 Thread hemant singh
Check roll up and cube functions in spark sql. On Wed, 23 Jan 2019 at 10:47 PM, Pierremalliard < pierre.de-malli...@capgemini.com> wrote: > Hi, > > I am trying to generate a dataframe of all combinations that have a same > key > using Pyspark. > > example: > > (a,1) > (a,2) > (a,3) > (b,1) >

Create all the combinations of a groupBy

2019-01-23 Thread Pierremalliard
Hi, I am trying to generate a dataframe of all combinations that have a same key using Pyspark. example: (a,1) (a,2) (a,3) (b,1) (b,2) should return: (a, 1 , 2) (a, 1 , 3) (a, 2, 3) (b, 1 ,2) i want to do something like df.groupBy('key').combinations().apply(...) any suggestions are

Re: Spark UI History server on Kubernetes

2019-01-23 Thread Li Gao
In addition to what Rao mentioned, if you are using cloud blob storage such as AWS S3, you can specify your history location to be an S3 location such as: `s3://mybucket/path/to/history` On Wed, Jan 23, 2019 at 12:55 AM Rao, Abhishek (Nokia - IN/Bangalore) < abhishek@nokia.com> wrote: > Hi

Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Tomas Bartalos
This might help: show tables; st 23. 1. 2019 o 10:43 napísal(a): > Hi, All, > > We need to get all input tables of several SPARK SQL 'select' statements. > > We can get those information of Hive SQL statements by using 'explain > dependency select'. > But I can't find the equivalent

Re: I have trained a ML model, now what?

2019-01-23 Thread Pola Yao
Hi Riccardo, Right now, Spark does not support low-latency predictions in Production. MLeap is an alternative and it's been used in many scenarios. But it's good to see that Spark Community has decided to provide such support. On Wed, Jan 23, 2019 at 7:53 AM Riccardo Ferrari wrote: > Felix,

Re: I have trained a ML model, now what?

2019-01-23 Thread Riccardo Ferrari
Felix, thank you very much for the link. Much appreciated. The attached PDF is very interesting, I found myself evaluating many of the scenarios described in Q3. It's unfortunate the proposal is not being worked on, would be great to see that part of the code base. It is cool to see big players

Spark Stateful Streaming - add counter column

2019-01-23 Thread Femi Anthony
I have a a Spark Streaming process that consumes records off a Kafka topic, processes them and sends them to a producer to publish on another topic. I would like to add a sequence number column that can be used to identify records that have the same key and be incremented for each duplicate

Please add Singapore Spark meetup to Community page... thank you!

2019-01-23 Thread Arseny Chernov
Hello dear Sir/Madam, Please add https://meetup.com/Spark-Singapore/ to the page https://spark.apache.org/community.html Thanks, Arseny

Re: SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-23 Thread Andrea Santurbano
+1 Graph analytics is now mainstream, and having Cypher first-class support in Spark would allow users to deal with highly connected datasets (fraud detection, epidemiology analysis, genomic analysis, and so on) going beyond the limits of joins when you must traverse a dataset. On 2019/01/15

Re: SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-23 Thread Andrea Santurbano
+1 Graph analytics is now mainstream, and having Cypher first-class support in Spark would allow users to deal with highly connected datasets (fraud detection, epidemiology analysis, genomic analysis, and so on) going beyond the limits of joins when you must traverse a dataset. On 2019/01/15

Re: Increase time for Spark Job to be in Accept mode in Yarn

2019-01-23 Thread Chetan Khatri
Hello Beliefer, I am orchestrating many spark jobs using Airflow and when some of the spark jobs get started and running and many other would be in accepted mode and sometimes 1-2 jobs go to failure state if yarn cannot create container application. Thanks On Wed, Jan 23, 2019 at 9:15 AM 大啊

Re:Customizing Spark ThriftServer

2019-01-23 Thread 大啊
Spark ThriftServer is a spark application that possess thrift server. your code is a custom spark application. If you need some custome function beyond Spark ThriftServer, you can make your spark application contains HiveThriftServer2. At 2019-01-23 17:53:01, "Soheil Pourbafrani" wrote:

Unsubscribe

2019-01-23 Thread pokemonmaster9505

How to optimize iterative data processing in spark application

2019-01-23 Thread Federico D'Ambrosio
Hello everyone, I have a spark application processing data iteratively within an RDD until .isEmpty() is true. Now the loop is sort of like it follows mainRDD = sc.parallelize(...) //initialize mainRDD do { rdd1 = mainRDD.flatMapToPair(advanceState)//advance state of element rdd2 =

Customizing Spark ThriftServer

2019-01-23 Thread Soheil Pourbafrani
Hi, I want to create a thrift server that has some hive table predefined and listen on a port for the user query. Here is my code: val spark = SparkSession.builder() .config("hive.server2.thrift.port", "1") .config("spark.sql.hive.thriftServer.singleSession", "true")

How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread luby
Hi, All, We need to get all input tables of several SPARK SQL 'select' statements. We can get those information of Hive SQL statements by using 'explain dependency select'. But I can't find the equivalent command for SPARK SQL. Does anyone know how to get this information of a SPARK SQL

Re: How to query on Cassandra and load results in Spark dataframe

2019-01-23 Thread Riccardo Ferrari
Hi Soheil, You should able to apply some filter transformation. Spark is lazy evaluated and the actual loading from Cassandra happens only when an action triggers it. Find more here: https://spark.apache.org/docs/2.3.2/rdd-programming-guide.html#rdd-operations The Spark Cassandra supports

RE: Spark UI History server on Kubernetes

2019-01-23 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Lakshman, We’ve set these 2 properties to bringup spark history server spark.history.fs.logDirectory spark.history.ui.port We’re writing the logs to HDFS. In order to write logs, we’re setting following properties while submitting the spark job spark.eventLog.enabled true