Re: Spark Dataset withColumn issue

2020-11-12 Thread Subash Prabakar
Hi Vikas,

He suggested to use the select() function after your withColumn function.

val ds1 = ds.select("Col1", "Col3").withColumn("Col2",
lit("sample”)).select(“Col1”, “Col2”, “Col3")


Thanks,
Subash

On Thu, Nov 12, 2020 at 9:19 PM Vikas Garg  wrote:

> I am deriving the col2 using with colunn which is why I cant use it like
> you told me
>
> On Thu, Nov 12, 2020, 20:11 German Schiavon 
> wrote:
>
>> ds.select("Col1", "Col2", "Col3")
>>
>> On Thu, 12 Nov 2020 at 15:28, Vikas Garg  wrote:
>>
>>> In Spark Datase, if we add additional column using
>>> withColumn
>>> then the column is added in the last.
>>>
>>> e.g.
>>> val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample"))
>>>
>>> the the order of columns is >> Col1  |  Col3  |  Col2
>>>
>>> I want the order to be  >> Col1  |  Col2  |  Col3
>>>
>>> How can I achieve this?
>>>
>>


Re: Going it alone.

2020-04-15 Thread Subash Prabakar
Looks like he had a very bad appraisal this year.. Fun fact : the coming
year would be too :)

On Thu, 16 Apr 2020 at 12:07, Qi Kang  wrote:

> Well man, check your attitude, you’re way over the line
>
>
> On Apr 16, 2020, at 13:26, jane thorpe 
> wrote:
>
> F*U*C*K O*F*F
> C*U*N*T*S
>
>
> --
> On Thursday, 16 April 2020 Kelvin Qin  wrote:
>
> No wonder I said why I can't understand what the mail expresses, it feels
> like a joke……
>
>
>
>
>
> 在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道:
>
> Have we been tricked by a bot ?
>
>
> *From:* Matt Smith 
> *Sent:* Wednesday, April 15, 2020 2:23 PM
> *To:* jane thorpe
> *Cc:* dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com;
> em...@yeikel.com
> *Subject:* Re: Going it alone.
>
>
> *CAUTION EXTERNAL EMAIL:* DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS THAT
> ARE UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO SPAM
> SUBMISSIONS.
>
> This is so entertaining.
>
>
> 1. Ask for help
>
> 2. Compare those you need help from to a lower order primate.
>
> 3. Claim you provided information you did not
>
> 4. Explain that providing any information would be "too revealing"
>
> 5. ???
>
>
> Can't wait to hear what comes next, but please keep it up.  This is a
> bright spot in my day.
>
>
>
> On Tue, Apr 14, 2020 at 4:47 PM jane thorpe 
> wrote:
>
> I did write a long email in response to you.
> But then I deleted it because I felt it would be too revealing.
>
>
>
>
> --
>
> On Tuesday, 14 April 2020 David Hesson  wrote:
>
> I want to know  if Spark is headed in my direction.
>
> You are implying  Spark could be.
>
>
>
> What direction are you headed in, exactly? I don't feel as if anything
> were implied when you were asked for use cases or what problem you are
> solving. You were asked to identify some use cases, of which you don't
> appear to have any.
>
>
> On Tue, Apr 14, 2020 at 4:49 PM jane thorpe 
> wrote:
>
> That's what  I want to know,  Use Cases.
> I am looking for  direction as I described and I want to know  if Spark is
> headed in my direction.
>
> You are implying  Spark could be.
>
> So tell me about the USE CASES and I'll do the rest.
> --
>
> On Tuesday, 14 April 2020 yeikel valdes  wrote:
>
> It depends on your use case. What are you trying to solve?
>
>
>
>  On Tue, 14 Apr 2020 15:36:50 -0400 *janethor...@aol.com.INVALID
>  *wrote 
>
> Hi,
>
> I consider myself to be quite good in Software Development especially
> using frameworks.
>
> I like to get my hands  dirty. I have spent the last few months
> understanding modern frameworks and architectures.
>
> I am looking to invest my energy in a product where I don't have to
> relying on the monkeys which occupy this space  we call software
> development.
>
> I have found one that meets my requirements.
>
> Would Apache Spark be a good Tool for me or  do I need to be a member of a
> team to develop  products  using Apache Spark  ?
>
>
>
>
>
> PLEASE READ: This message is for the named person's use only. It may
> contain confidential, proprietary or legally privileged information. No
> confidentiality or privilege is waived or lost by any mistransmission. If
> you receive this message in error, please delete it and all copies from
> your system, destroy any hard copies and notify the sender. You must not,
> directly or indirectly, use, disclose, distribute, print, or copy any part
> of this message if you are not the intended recipient. Nomura Holding
> America Inc., Nomura Securities International, Inc, and their respective
> subsidiaries each reserve the right to monitor all e-mail communications
> through its networks. Any views expressed in this message are those of the
> individual sender, except where the message states otherwise and the sender
> is authorized to state the views of such entity. Unless otherwise stated,
> any pricing information in this message is indicative only, is subject to
> change and does not constitute an offer to deal at any price quoted. Any
> reference to the terms of executed transactions should be treated as
> preliminary only and subject to our formal written confirmation.
>
>
>


Apache Arrow support for Apache Spark

2020-02-16 Thread Subash Prabakar
Hi Team,

I have two questions regarding Arrow and Spark integration,

1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back
in another Spark application? If so how could I do that ?
Note: The dataset is transient  - separation of responsibility is for
easier management. Though resiliency inside spark - we use different
language (in our case Java and Python)

Thanks,
Subash


Re: [Spark SQL] failure in query

2019-08-29 Thread Subash Prabakar
What is the no of part files in that big table? And what is the
distribution of request ID? Is the variance of the column is less or huge?
Because partitionBy clause will move data with same request ID to one
executor. If the data is huge it might put load on executor.

On Sun, 25 Aug 2019 at 16:56, Tzahi File  wrote:

> Hi,
>
> I encountered some issue to run a spark SQL query, and will happy to some
> advice.
> I'm trying to run a query on a very big data set (around 1.5TB) and it
> getting failures in all of my tries. A template of the query is as below:
> insert overwrite table partition(part)
> select  /*+ BROADCAST(c) */
>  *, row_number() over (partition by request_id order by economic_value
> DESC) row_number
> from (
> select a,b,c,d,e
> from table (raw data 1.5TB))
> left join small_table
>
> The heavy part in this query is the window function.
> I'm using 65 spots of type 5.4x.large.
> The spark settings:
> --conf spark.driver.memory=10g
> --conf spark.sql.shuffle.partitions=1200
> --conf spark.executor.memory=22000M
> --conf spark.shuffle.service.enabled=false
>
>
> You can see below an example of the errors that I get:
> [image: image.png]
>
>
> any suggestions?
>
>
>
> Thanks!
> Tzahi
>


Re: Caching tables in spark

2019-08-28 Thread Subash Prabakar
When you mean by process is it two separate spark jobs? Or two stages
within same spark code?

Thanks
Subash

On Wed, 28 Aug 2019 at 19:06,  wrote:

> Take a look at this article
>
>
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html
>
>
>
> *From:* Tzahi File 
> *Sent:* Wednesday, August 28, 2019 5:18 AM
> *To:* user 
> *Subject:* Caching tables in spark
>
>
>
> Hi,
>
>
>
> Looking for your knowledge with some question.
>
> I have 2 different processes that read from the same raw data table
> (around 1.5 TB).
>
> Is there a way to read this data once and cache it somehow and to use this
> data in both processes?
>
>
>
>
>
> Thanks
>
> --
>
> *Tzahi File*
> Data Engineer
>
> [image: ironSource] 
>
> *email* tzahi.f...@ironsrc.com
>
> *mobile* +972-546864835
>
> *fax* +972-77-5448273
>
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> 
>
> *ironsrc.com* 
>
> [image: linkedin] [image:
> twitter] [image: facebook]
> [image: googleplus]
> 
>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>
>
>


Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-12 Thread Subash Prabakar
I had the similar issue reading the external parquet table . In my case I
had permission issue in one partition so I added filter to exclude that
partition but still the spark didn’t prune it. Then I read that in order
for spark to be aware of all the partitions it first read the folders and
then updated its metastore . Then the sql is applied on TOP of it. Instead
of using the existing hive SerDe and this property is only for parquet
files.

Hive metastore Parquet table conversion


When reading from and writing to Hive metastore Parquet tables, Spark SQL
will try to use its own Parquet support instead of Hive SerDe for better
performance. This behavior is controlled by the
spark.sql.hive.convertMetastoreParquetconfiguration, and is turned on by
default.

Reference:
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html

Set the above property to false . It should work.

If anyone have better explanation please let me know - I have same
question. Why only parquet has this problem ?

Thanks
Subash

On Fri, 9 Aug 2019 at 16:18, Hao Ren  wrote:

> Hi Mich,
>
> Thank you for your reply.
> I need to be more clear about the environment. I am using spark-shell to
> run the query.
> Actually, the query works even without core-site, hdfs-site being under
> $SPARK_HOME/conf.
> My problem is efficiency. Because all of the partitions was scanned
> instead of the one in question during the execution of the spark sql query.
> This is why this simple query takes too much time.
> I would like to know how to improve this by just reading the specific
> partition in question.
>
> Feel free to ask more questions if I am not clear.
>
> Best regards,
> Hao
>
> On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh 
> wrote:
>
>> also need others as well using soft link ls -l
>>
>> cd $SPARK_HOME/conf
>>
>> hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
>> core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
>> hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 8 Aug 2019 at 15:16, Hao Ren  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Hao Ren 
>>> Date: Thu, Aug 8, 2019 at 4:15 PM
>>> Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive
>>> table
>>> To: Gourav Sengupta 
>>>
>>>
>>> Hi Gourva,
>>>
>>> I am using enableHiveSupport.
>>> The table was not created by Spark. The table already exists in Hive.
>>> All I did is just reading it by using SQL query in Spark.
>>> FYI, I put hive-site.xml in spark/conf/ directory to make sure that
>>> Spark can access to Hive.
>>>
>>> Hao
>>>
>>> On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 Just out of curiosity did you start the SPARK session using
 enableHiveSupport() ?

 Or are you creating the table using SPARK?


 Regards,
 Gourav

 On Wed, Aug 7, 2019 at 3:28 PM Hao Ren  wrote:

> Hi,
> I am using Spark SQL 2.3.3 to read a hive table which is partitioned
> by day, hour, platform, request_status and is_sampled. The underlying data
> is in parquet format on HDFS.
> Here is the SQL query to read just *one partition*.
>
> ```
> spark.sql("""
> SELECT rtb_platform_id, SUM(e_cpm)
> FROM raw_logs.fact_request
> WHERE day = '2019-08-01'
> AND hour = '00'
> AND platform = 'US'
> AND request_status = '3'
> AND is_sampled = 1
> GROUP BY rtb_platform_id
> """).show
> ```
>
> However, from the Spark web UI, the stage description shows:
>
> ```
> Listing leaf files and directories for 201616 paths:
> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0,
> ...
> ```
>
> It seems the job is reading all of the partitions of the table and the
> job takes too long for just one partition. One workaround is using
> `spark.read.parquet` API to read parquet files directly. Spark has
> partition-awareness for partitioned directories.
>
> But still, I would like to know if there is a way to leverage
> partition-awareness via Hive by using `spark.sql` API?
>
> Any help is highly appreciated!
>
> Thank you.
>

Spark Dataframe NTILE function

2019-06-12 Thread Subash Prabakar
Hi,

I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.

The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records order by one metric.

Few stats are below, I need help in finding alternatives or anyone did some
benchmarking of highest load this function can handle ?

The below snapshot is the calculation of NTILE for two columns separately -
each runs and that final 1 partition is where the complete data is present
- meaning, Window function moves all to 1 final partition to compute NTILE
- which is 80M in my case.

[image: Screen Shot 2019-06-13 at 12.51.56 AM.jpg]

Executor memory is 8G - with shuffle.storageMemory of 0.8 => so it is 5.5G

So ideally 80M records I saw inside the stage level metrics - it shows as
below,

*Shuffle read size / records*
[image: Screen Shot 2019-06-13 at 12.51.33 AM.jpg]

Is there any alternative or is it not feasible to perform this operation in
Spark SQL functions ?

Thanks,
Subash


Feature engineering ETL for machine learning

2019-04-20 Thread Subash Prabakar
Hi,

I have a series of queries to extract from multiple tables in hive and do a
feature engineering on the extracted final data.. I can run queries using
spark sql  and use mllib to perform the feature transformation I needed.
The question is do you guys use any kind of tool to perform this workflow
for executing the query or is there any tool which on giving a
template/JSON will construct the spark sql queries for me?


Thanks
Subash


Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Subash Prabakar
Hey Rajat,

The documentation page is self explanatory..

You can refer this for more configs

https://spark.apache.org/docs/2.0.0/configuration.html
 or any version of Spark documentation

Thanks.
Subash

On Sat, 20 Apr 2019 at 16:04, rajat kumar 
wrote:

> Hi,
>
> Can anyone pls explain ?
>
>
> On Mon, 15 Apr 2019, 09:31 rajat kumar 
>> Hi All,
>>
>> I came across different parameters in spark submit
>>
>> --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
>>
>> What are the differences between them? When to use which one? Will it
>> differ
>> if I use following:
>>
>> --master yarn --deploy-mode client
>> --master yarn --deploy-mode cluster
>>
>>
>> Thanks
>> Rajat
>>
>


Difference between Checkpointing and Persist

2019-04-18 Thread Subash Prabakar
Hi All,

I have a doubt about checkpointing and persist/saving.

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving
is to just persist intermediate result before joining)


Which of the above is faster and whats the difference?


Thanks,
Subash


Spark2: Deciphering saving text file name

2019-04-08 Thread Subash Prabakar
Hi,
While saving in Spark2 as text file - I see encoded/hash value attached in
the part files along with part number. I am curious to know what is that
value is about ?

Example:
ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path)

Produces,
part-1-1e4c5369-6694-4012-894a-73b971fe1ab1-c000.txt.gz


1e4c5369-6694-4012-894a-73b971fe1ab1-c000 => what is this value ?

Is there any options available to remove this part or is it attached for
some reason ?

Thanks,
Subash