Re: 1 Executor per partition

2018-04-04 Thread utkarsh_deep
You are correct. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to Hive to Spark migration

2018-04-04 Thread Jörn Franke
You need to provide more context on what you do currently in Hive and what do you expect from the migration. > On 5. Apr 2018, at 05:43, Pralabh Kumar wrote: > > Hi Spark group > > What's the best way to Migrate Hive to Spark > > 1) Use HiveContext of Spark > 2) Use

Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar
Hi Spark group What's the best way to Migrate Hive to Spark 1) Use HiveContext of Spark 2) Use Hive on Spark ( https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started ) 3) Migrate Hive to Calcite to Spark SQL Regards

Spark uses more threads than specified in local[n]

2018-04-04 Thread Xiangyu Li
Hi, I am running a job in local mode, configured with local[1] for the sake of the example. The timeline view in Spark UI is as follows: It shows that there are actually two threads running, though their overlapping is very small. To validate this, I also added some change to Spark's task

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-04 Thread Andy Davidson
I am having a heck of a time setting up my development environment. I used pip to install pyspark. I also downloaded spark from apache. My eclipse pyDev intereperter is configured as a python3 virtualenv I have a simple unit test that loads a small dataframe. Df.show() generates the following

Re: trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Gourav Sengupta
Hi, the way I manage things is, download spark, and set SPARK_HOME and the import findspark and run findspark.init(). And everything else works just fine. I have never tried pip install pyspark though. Regards, Gourav Sengupta On Wed, Apr 4, 2018 at 11:28 PM, Andy Davidson <

trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Andy Davidson
I am having trouble setting up my python3 virtualenv. I created a virtualenv Œspark-2.3.0¹ Installed pyspark using pip how ever I am not able to import pyspark.sql.functions. I get ³unresolved import² when I try to import col() and lit() from pyspark.sql.functions import * I found if I

Re: 1 Executor per partition

2018-04-04 Thread Gourav Sengupta
Each partition should be translated into one task which should run in one executor. But one executor can process more than one task. I may be wrong, and will be grateful if someone can correct me. Regards, Gourav On Wed, Apr 4, 2018 at 8:13 PM, Thodoris Zois wrote: > > Hello

1 Executor per partition

2018-04-04 Thread Thodoris Zois
Hello list! I am trying to familiarize with Apache Spark. I would like to ask something about partitioning and executors. Can I have e.g: 500 partitions but launch only one executor that will run operations in only 1 partition of the 500? And then I would like my job to die. Is there any

Issue with nested JSON parsing in to data frame

2018-04-04 Thread Ritesh Shah
Hello, I am using Apache Spark 2.2.1 with Scala. I am trying to load below JSON from Kafka and trying to extract "JOBTYPE" and "LOADID" from the nested JSON object. Need help with extraction logic. Code val workRequests = new StructType().add("after", new StructType()

Re: Scala program to spark-submit on k8 cluster

2018-04-04 Thread purna pradeep
yes “REST application that submits a Spark job to a k8s cluster by running spark-submit programmatically” and also would like to expose as a Kubernetes service so that clients can access as any other Rest api On Wed, Apr 4, 2018 at 12:25 PM Yinan Li wrote: > Hi Kittu, > >

Re: Scala program to spark-submit on k8 cluster

2018-04-04 Thread Yinan Li
Hi Kittu, What do you mean by "a Scala program"? Do you mean a program that submits a Spark job to a k8s cluster by running spark-submit programmatically, or some example Scala application that is to run on the cluster? On Wed, Apr 4, 2018 at 4:45 AM, Kittu M wrote: > Hi,

ClassCastException: java.sql.Date cannot be cast to java.lang.String in Scala

2018-04-04 Thread anbu
Could you someone please help me how to fix this below error in spark 2.1.0 scala-2.11.8 Baically I'm migrating the code from spark 1.6.0 to spark-2.1.0. I'm getting the below exception in spark 2.1.0 Error: java.lang.ClassCastException: java.sql.Date cannot be cast to java.lang.String at

Scala program to spark-submit on k8 cluster

2018-04-04 Thread Kittu M
Hi, I’m looking for a Scala program to spark submit a Scala application (spark 2.3 job) on k8 cluster . Any help would be much appreciated. Thanks

Re: NumberFormatException while reading and split the file

2018-04-04 Thread utkarsh_deep
Response to the 1st approach: When you do spark.read.text("/xyz/a/b/filename") it returns a DataFrame and when applying the rdd methods gives you a RDD[Row], so when you use map, your function get Row as the parameter i.e; ip in your code. Therefore you must use the Row methods to access its

Re: How to delete empty columns in df when writing to parquet?

2018-04-04 Thread Junfeng Chen
Our users ask for it Regard, Junfeng Chen On Wed, Apr 4, 2018 at 5:45 PM, Gourav Sengupta wrote: > Hi Junfeng, > > can I ask why it is important to remove the empty column? > > Regards, > Gourav Sengupta > > On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen

Re: How to delete empty columns in df when writing to parquet?

2018-04-04 Thread Gourav Sengupta
Hi Junfeng, can I ask why it is important to remove the empty column? Regards, Gourav Sengupta On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen wrote: > I am trying to read data from kafka and writing them in parquet format via > Spark Streaming. > The problem is, the data

Building Datwarehouse Application in Spark

2018-04-04 Thread Mahender Sarangam
Hi, Does anyone has good architecture document/design principle for building warehouse application using Spark. Is it better way of having Hive Context created with HQL and perform transformation or Directly loading files in dataframe and perform data transformation. We need to implement SCD

Re: run huge number of queries in Spark

2018-04-04 Thread Georg Heiler
See https://gist.github.com/geoHeil/e0799860262ceebf830859716bbf in particular: You will probably want to use sparks imperative (non SQL) API: .rdd .reduceByKey { (count1, count2) => count1 + count2 }.map { case ((word, path), n) => (word, (path, n)) }.toDF i.e. builds an inverted index which

run huge number of queries in Spark

2018-04-04 Thread Donni Khan
Hi all, I want to run huge number of queries on Dataframe in Spark. I have a big data of text documents, I loded all documents into SparkDataFrame and create a temp table. dataFrame.registerTempTable("table1"); I have more than 50,000 terms, I want to get the document frequency for each by

NumberFormatException while reading and split the file

2018-04-04 Thread anbu
1st Approach: error : value split is not a member of org.apache.spark.sql.Row? val newRdd = spark.read.text("/xyz/a/b/filename").rdd anotherRDD = newRdd. map(ip =>ip.split("\\|")).map(ip => Row(if (ip(0).isEmpty()) { null.asInstanceOf[Int] }