Interest in adding ability to request GPU's to the spark client?

2018-05-15 Thread Daniel Galvez
Hi all, Is anyone here interested in adding the ability to request GPUs to Spark's client (i.e, spark-submit)? As of now, Yarn 3.0's resource manager server has the ability to schedule GPUs as resources via cgroups, but the Spark client lacks an ability to request these. The ability to guarantee

[structured-streaming][kafka] Will the Kafka readstream timeout after connections.max.idle.ms 540000 ms ?

2018-05-15 Thread karthikjay
Hi all, We are running into a scenario where the structured streaming job is exiting after a while specifically when the Kafka topic is not getting any data. >From the job logs, I see this connections.max.idle.ms = 54. Does that mean the spark readstream will close when it does not get data

Continuous Processing mode behaves differently from Batch mode

2018-05-15 Thread Yuta Morisawa
Hi all Now I am using Structured Streaming in Continuous Processing mode and I faced a odd problem. My code is so simple that it is similar to the sample code on the documentation. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing When

Re: Structured Streaming, Reading and Updating a variable

2018-05-15 Thread Lalwani, Jayesh
Do you have a code sample, and detailed error message/exception to show? From: Martin Engen Date: Tuesday, May 15, 2018 at 9:24 AM To: "user@spark.apache.org" Subject: Structured Streaming, Reading and Updating a variable Hello, I'm working

Re: [Arrow][Dremio]

2018-05-15 Thread Xavier Mehaut
thanks bryan for the answer Envoyé de mon iPhone > Le 15 mai 2018 à 19:06, Bryan Cutler a écrit : > > Hi Xavier, > > Regarding Arrow usage in Spark, using Arrow format to transfer data between > Python and Java has been the focus so far because this area stood to benefit

java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes

2018-05-15 Thread Mina Aslani
Hi, I am trying to test my spark app implemented in Java. In my spark app I load the logisticRegressionModel that I have already created, trained and tested using the portion of training data. Now, when I test my spark app with another set of data and try to predict, I get below error when

Re: [Arrow][Dremio]

2018-05-15 Thread Bryan Cutler
Hi Xavier, Regarding Arrow usage in Spark, using Arrow format to transfer data between Python and Java has been the focus so far because this area stood to benefit the most. It's possible that the scope of Arrow could broaden in the future, but there still needs to be discussions about this.

Sklearn model in pyspark prediction

2018-05-15 Thread HARSH TAKKAR
Hi, Is there a way to load model saved using sklearn lib in pyspark/ scala spark for prediction. Thanks

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Mina Aslani
Hi, So, what is the workaround? Should I create multiple indexer(one for each column), and then create pipeline and set stages to have all the StringIndexers? I am using 2.2.1 as I cannot move to 2.3.0. Looks like oneHotEncoderEstimator is broken, please see my email sent today with subject:

Re: Scala's Seq:* equivalent in java

2018-05-15 Thread Koert Kuipers
Isn't _* varargs? So you should be able to use Java array? On Tue, May 15, 2018, 06:29 onmstester onmstester wrote: > I could not find how to pass a list to isin() filter in java, something > like this could be done with scala: > > val ids = Array(1,2) >

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-15 Thread रविशंकर नायर
Hi Jacek, If we use RDD instead of Dataframe, can we accomplish the same? I mean, is joining between RDDS allowed in Spark streaming ? Best, Ravi On Sun, May 13, 2018 at 11:18 AM Jacek Laskowski wrote: > Hi, > > The exception message should be self-explanatory and says that

Re: Structured Streaming, Reading and Updating a variable

2018-05-15 Thread Koert Kuipers
You use a windowed aggregation for this On Tue, May 15, 2018, 09:23 Martin Engen wrote: > Hello, > > > > I'm working with Structured Streaming, and I need a method of keeping a > running average based on last 24hours of data. > > To help with this, I can use

OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-15 Thread Mina Aslani
Hi, I get below error when I try to run oneHotEncoderEstimator example. https://github.com/apache/spark/blob/b74366481cc87490adf4e69d26389ec737548c15/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java#L67 Which is this line of the code:

Re: UDTF registration fails for hiveEnabled SQLContext

2018-05-15 Thread Mick Davies
I am trying to register a UDTF not a UDF. So I don't think this applies Mick -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Structured Streaming, Reading and Updating a variable

2018-05-15 Thread Martin Engen
Hello, I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data. To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable

Re: UDTF registration fails for hiveEnabled SQLContext

2018-05-15 Thread Ajay
You can register udf's by using the in-built udf function as well using (import org.apache.spark.sql.functions._) Something along the lines of val flattenUdf = udf(udfutils.flatten) where udfutils is another object and flatten is a method in it. On Tue, May 15, 2018 at 3:27 AM Mick Davies

Spark structured streaming aggregation within microbatch

2018-05-15 Thread Koert Kuipers
I have a streaming dataframe where I insert a uuid in every row, then join with a static dataframe (after which uuid column is no longer unique), then group by uuid and do a simple aggregation. So I know all rows with same uuid will be in same micro batch, guaranteed, correct? How do I express it

Scala's Seq:* equivalent in java

2018-05-15 Thread onmstester onmstester
I could not find how to pass a list to isin() filter in java, something like this could be done with scala: val ids = Array(1,2) df.filter(df("id").isin(ids:_*)).show But in java everything that converts java list to scala Seq fails with unsupported literal type exception:

Re: UDTF registration fails for hiveEnabled SQLContext

2018-05-15 Thread Mick Davies
Hi Gourav, I don't think you can register UDTFs via sparkSession.udf.register Mick -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

What to consider when implementing a custom streaming sink?

2018-05-15 Thread kant kodali
Hi All, I am trying to implement a custom sink and I have few questions mainly on output modes. 1) How does spark let the sink know that a new row is an update of an existing row? does it look at all the values of all columns of the new row and an existing row for an equality match or does it

Re: spark sql StackOverflow

2018-05-15 Thread Jörn Franke
3000 filters don’t look like something reasonable. This is very difficult to test and verify as well as impossible to maintain. Could it be that your filters are another table that you should join with ? The example is a little bit artificial to understand the underlying business case. Can you

Re: spark sql StackOverflow

2018-05-15 Thread Alessandro Solimando
>From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you

Spark streaming with kafka input stuck in (Re-)joing group because of group rebalancing

2018-05-15 Thread JF Chen
When I terminate a spark streaming application and restart it, it always stuck in this step: > > Revoking previously assigned partitions [] for group [mygroup] > (Re-)joing group [mygroup] If I use a new group id, even though it works fine, I may lose the data from the last time I read the

Re: spark sql StackOverflow

2018-05-15 Thread Alessandro Solimando
Hi, I am not familiar with ATNConfigSet, but some thoughts that might help. How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time? Are these records ingested in real-time or are they loaded from a datastore? If the latter case the DB might be able

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0 The PR is still in progress I think - should be available in 2.4.0 On Mon, 14 May 2018 at 22:32, Mina Aslani wrote: > Please take a look at the api doc: >

spark sql StackOverflow

2018-05-15 Thread onmstester onmstester
Hi, I need to run some queries on huge amount input records. Input rate for records are 100K/seconds. A record is like (key1,key2,value) and the application should report occurances of kye1 = something key2 == somethingElse. The problem is there are too many filters in my query: more than