Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way? val spark = SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, b.jar").getOrCreate() Thanks Yanbo On Thu, Apr 27, 2017 at 9:21 AM, kant kodali wrote: > I am using Spark 2.1 BTW. > > On Wed, Apr 26, 2017 at 3:22

Re: help/suggestions to setup spark cluster

2017-04-27 Thread Cody Koeninger
You can just cap the cores used per job. http://spark.apache.org/docs/latest/spark-standalone.html http://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling On Thu, Apr 27, 2017 at 1:07 AM, vincent gromakowski wrote: > Spark standalone is not

javaRDD to collectasMap throuwa ava.lang.NegativeArraySizeException

2017-04-27 Thread Manohar753
HI All, getting the below Exception while converting my rdd to Map below is the code.and my data size is hardly 200MD snappy file and the code looks like this @SuppressWarnings("unchecked") public Tuple2, String> getMatchData(String location, String key) {

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
​You can try with UDF, like the following code snippet: from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StringType df = spark.read.text("./README.md")​ split_func = udf(lambda text: text.split(" "), ArrayType(StringType())) df.withColumn("split_value",

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table? On Thu, Apr 27, 2017 at 9:58 PM, Nishanth wrote: > I am facing a major issue on replacement of Synonyms in my DataSet. > > I am trying to replace the synonym of the Brand names to its equivalent > names. > > I have

Data Skew in Dataframe Groupby - Any suggestions?

2017-04-27 Thread KhajaAsmath Mohammed
Hi, I am working on requirement where I need to perform groupby on set of data and find the max value on that group. GroupBy on dataframe is resulting in skewness and job is running for quite a long time (actually more time than in Hive and Impala for one day worth of data). Any suggestions on

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread kant kodali
Ahhh Thanks much! I miss my sparkConf.setJars function instead of this hacky comma separated jar names. On Thu, Apr 27, 2017 at 8:01 AM, Yanbo Liang wrote: > Could you try the following way? > > val spark = >

[Pyspark, Python 2.7] Executor hangup caused by Unicode error while logging uncaught exception in worker

2017-04-27 Thread Sebastian Nagel
Hi, I've seen a hangup of a job (resp. one of the executors) if the message of an uncaught exception contains bytes which cannot be properly decoded as Unicode characters. The last lines in the executor logs were PySpark worker failed with exception: Traceback (most recent call last): File

Re: Spark Testing Library Discussion

2017-04-27 Thread Sam Elamin
Hi @Lucas I certainly would love to write an integration testing library for workflows, I have a few ideas I would love to share with others and they are focused around Airflow since that is what we use As promised here is

Re: help/suggestions to setup spark cluster

2017-04-27 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe you can try fair scheduling and use one cluster but i doubt it will be prod ready... Le 27 avr. 2017 5:28 AM, "anna stax" a écrit : > Thanks Cody, > > As I already mentioned I am running spark

[Spark Core] Why SetAccumulator is buried in org.apache.spark.sql.execution.debug?

2017-04-27 Thread v . chesnokov
Currently SetAccumulator (which extends AccumulatorV2[T,java.util.Set[T]]) is a nested class of org.apache.spark.sql.execution.debug.DebugExec. I wonder why this quite useful class is buried there, in spark-sql, and not presented in org.apache.spark.util of spark-core.

RE: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Lavelle, Shawn
Hi Jacek, I know that it is not currently doing so, but should it be? The algorithm isn’t complicated and could be applied to both OR and AND logical operators with comparison operators as children. My users write programs to generate queries that aren’t checked for this sort of

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Jacek Laskowski
Hi Shawn, If you're asking me if Spark SQL should optimize such queries, I don't know. If you're asking me if it's possible to convince Spark SQL to do so, I'd say, sure, it is. Write your optimization rule and attach it to Optimizer (using extraOptimizations extension point). Pozdrawiam,

Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Nishanth
I am facing a major issue on replacement of Synonyms in my DataSet. I am trying to replace the synonym of the Brand names to its equivalent names. I have tried 2 methods to solve this issue. Method 1 (regexp_replace) Here i am using the regexp_replace method. Hashtable manufacturerNames = new

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread kant kodali
Actually one more question along the same line. This is about .getOrCreate() ? JavaStreamingContext doesn't seem to have a way to accept SparkSession object so does that mean a streaming context is not required? If so, how do I pass a lambda to .getOrCreate using SparkSession? The lambda that we

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Indeed I have. But, even when storing the offsets in Spark and committing offsets upon completion of an output operation within the foreachRDD call (as pointed in the example), the only offset that Spark’s Kafka implementation commits to Kafka is the offset of the last message. For example, if

Re: Calculate mode separately for multiple columns in row

2017-04-27 Thread Everett Anderson
For the curious, I played around with a UDAF for this (shown below). On the downside, it assembles a Map of all possible values of the column that'll need to be stored in memory somewhere. I suspect some kind of sorted groupByKey + cogroup could stream values through, though might not support

Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-04-27 Thread Tim Smith
Hi, I am trying to figure out the API to initialize a gaussian mixture model using either centroids created by K-means or previously calculated GMM model (I am aware that you can "save" a model and "load" in later but I am not interested in saving a model to a filesystem). The Spark MLlib API

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
Are you asking for commits for every message? Because that will kill performance. On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric wrote: > Indeed I have. But, even when storing the offsets in Spark and committing > offsets upon completion of an output operation

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Of course I am not asking to commit for every message. But instead of, seeking to commit the last consumed offset at a given interval. For example, from the 1st until the 5th second, messages until offset 100.000 of the partition 10 were consumed, then from the 6th until the 10th second of

Why "Initial job has not accepted any resources"?

2017-04-27 Thread Yuan Fang
Here is my code. It works on local. setMaster("local[*]"). But it does not work for my remote spark cluster. I checked all logs. I did not find any error. It shows the following warning. Could you please help? Thank you very very much! 14:45:47.956 [Timer-0] WARN

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
If you're looking for some kind of instrumentation finer than at batch boundaries, you'd have to do something with the individual messages yourself. You have full access to the individual messages including offset. On Thu, Apr 27, 2017 at 1:27 PM, Dominik Safaric