Re: Error using .collect()

2019-05-13 Thread Shahab Yunus
Kumar sp, collect() brings in all the data represented by the rdd/dataframe into the memory of the single machine which is acting like driver. You will run out of memory if the underlying rdd/dataframe represents large volume of data distributed on several machines. If your data is huge even

Spark SQL LIMIT Gets Stuck

2019-05-01 Thread Shahab Yunus
Hi there. I have a Hive external table (storage format is ORC, data stored on S3, partitioned on one bigint type column) that I am trying to query through pyspark (or spark-shell) shell. df.count() fails with lower values of LIMIT clause with the following exception (seen in Spark UI.) df.show()

Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Shahab Yunus
Could be a tangential idea but might help: Why not use queryExecution and logicalPlan objects that are available when you execute a query using SparkSession and get a DataFrame back? The Json representation contains almost all the info that you need and you don't need to go to Hive to get this

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
> > Original DF -> Iterate -> Pass every element to a function that takes the > element of the original DF and returns a new dataframe including all the > matching terms > > > > > > *From:* Andrew Melo > *Sent:* Friday, December 28, 2018 8:48 PM > *To:

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
Can you have a dataframe with a column which stores json (type string)? Or you can also have a column of array type in which you store all cities matching your query. On Fri, Dec 28, 2018 at 2:48 AM wrote: > Hi community , > > > > As shown in other answers online , Spark does not support the

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
-conditionally On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus wrote: > Have you tried using withColumn? You can add a boolean column based on > whether the age exists or not and then drop the older age column. You > wouldn't need union of dataframes then > > On Tue, Dec 18, 2018 at 8

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav wrote: > Hi All, > > > useful code: > > public class EmployeeBean

Re: Parallel read parquet file, write to postgresql

2018-12-03 Thread Shahab Yunus
Hi James. --num-executors is use to control the number of parallel tasks (each per executors) running for your application. For reading and writing data in parallel data partitioning is employed. You can look here for quick intro how data partitioning work:

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Shahab Yunus
Curious why you think this is not smart code? On Mon, Dec 3, 2018 at 8:04 AM James Starks wrote: > By taking with your advice flatMap, now I can convert result from > RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform > flatMap in the end before starting to convert RDD

Re: Creating spark Row from database values

2018-09-26 Thread Shahab Yunus
Hi there. Have you seen this link? https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393 It shows you multiple ways to manually create a dataframe. Hope it helps. Regards, Shahab On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin wrote: > Hello, > > Currently I have

Unsubscribe

2018-04-23 Thread Shahab Yunus
Unsubscribe

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
> If no, then you may use value's hash code as a category or combine all >> columns into a single vector using HashingTF. >> >> Regards, >> Filipp. >> >> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> >> wrote: >> >

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer keeps all the mapped label to indices in the memory of the driver machine? It seems to be unless I am missing something. What if our data that needs to be

Warnings on data insert into Hive Table using PySpark

2018-03-19 Thread Shahab Yunus
Hi there. When I try to insert data into hive tables using the following query, I get these warnings below. The data is inserted fine (the query works without warning directly on hive cli as well.) What is the reason for these warnings and how can we get rid of them? I am using pyspark

Accessing Scala RDD from pyspark

2018-03-15 Thread Shahab Yunus
Hi there. I am calling custom Scala code from pyspark (interpreter). The customer Scala code is simple: it just reads a textFile using sparkContext.textFile and returns RDD[String]. In pyspark, I am using sc._jvm to make the call to the Scala code: *s_rdd =

Spark Usecase

2014-06-04 Thread Shahab Yunus
Hello All. I have a newbie question. We have a use case where huge amount of data will be coming in streams or micro-batches of streams and we want to process these streams according to some business logic. We don't have to provide extremely low latency guarantees but batch M/R will still be