Re: Going it alone.

2020-04-14 Thread yeikel valdes
w  if Spark is headed in my direction.   You are implying  Spark could be. So tell me about the USE CASES and I'll do the rest. On Tuesday, 14 April 2020 yeikel valdes wrote: It depends on your use case. What are you trying to solve?  On Tue, 14 Apr 2020 15:36:50 -0400 jan

Re: Going it alone.

2020-04-14 Thread yeikel valdes
It depends on your use case. What are you trying to solve?  On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote Hi, I consider myself to be quite good in Software Development especially using frameworks. I like to get my hands  dirty. I have spent the last few mo

What is the best way to take the top N entries from a hive table/data source?

2020-04-13 Thread yeikel valdes
When I use .limit() , the number of partitions for the returning dataframe is 1 which normally fails most jobs. val df = spark.sql("select * from table limit n") df.write.parquet() Thanks!

Re: Serialization or internal functions?

2020-04-07 Thread yeikel valdes
Thanks for your input Soma , but I am actually looking to understand the differences and not only on the performance.  On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote If you want to  measure optimisation in terms of time taken , then here is an idea  :)   public

Re: IDE suitable for Spark

2020-04-07 Thread yeikel valdes
Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is missing a lot of the features that we expect from an IDE. Thanks for sharing though.  On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote When I first logged on I asked if there was a suitable

What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread yeikel valdes
I am currently using a third party library(Lucene) with Spark that is not serializable. Due to that reason, it generates the following exception  : Job aborted due to stage failure: Task 144.0 in stage 25.0 (TID 2122) had a not serializable result: org.apache.lucene.facet.FacetsConfig Serializa

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread yeikel valdes
Can you please explain what you mean with that? How do you use a udf to replace a join? Thanks On Mon, 24 Feb 2020 22:06:40 -0500 jianneng...@workday.com wrote Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF likely won't work. Jianneng From: Liu G

Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread yeikel valdes
>From what I understand, the session is a singleton so even if you think you >are creating new instances you are just reusing it.  On Wed, 29 Jan 2020 02:24:05 -1100 icbm0...@gmail.com wrote Dear all I already had a python function which is used to query data from HBase and HDFS w

Re: [External]Re: spark 2.x design docs

2019-09-19 Thread yeikel valdes
I am also interested. Many of the docs/books that I've seen are practical/examples about usage rather than deep internals of Spark. On Wed, 18 Sep 2019 21:12:12 -1100 vipul.s.p...@gmail.com wrote Yes, I realize what you were looking for, I am also looking for the same docs. Haven

Re:Does Spark SQL has match_recognize?

2019-05-26 Thread yeikel valdes
Isn't match_recognize just a filter? df.filter(predicate)? On Sat, 25 May 2019 12:55:47 -0700 kanth...@gmail.com wrote Hi All, Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected I believe it is one of the most useful concepts in the Financial applications

Re:Load Time from HDFS

2019-04-10 Thread yeikel valdes
What about a simple call to nanotime? long startTime = System.nanoTime(); //Spark work here long endTime = System.nanoTime(); long duration = (endTime - startTime) println(duration) Count recomputes the df so it makes sense it takes longer for you. On Tue, 02 Apr 2019 07:06:30 -0700 kol

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-10 Thread yeikel valdes
If you need to reduce the number of partitions you could also try df.coalesce On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" )  On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote: H

Re:Parquet file number of columns

2019-01-07 Thread yeikel valdes
Not according to Parquet dev group https://groups.google.com/forum/m/#!topic/parquet-dev/jj7TWPIUlYI On Mon, 07 Jan 2019 05:11:51 -0800 gourav.sengu...@gmail.com wrote Hi, Is there any limit to the number of columns that we can have in Parquet file format?  Thanks and Regards, Gour

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Ideally...we would like to copy paste and try in our end. A screenshot is not enough. If you have private information just remove and create a minimum example we can use to replicate the issue. I'd say similar to this : https://stackoverflow.com/help/mcve On Mon, 07 Jan 2019 04:15:16 -080

RE: Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
  Shashikant Bangera | DevOps Engineer Payment Services DevOps Engineering Email: shashikantbang...@discover.com Group email: eppdev...@discover.com Tel: +44 (0) Mob: +44 (0) 7440783885     From: yeikel valdes [mailto:em...@yeikel.com] Sent: 07 January 2019 12:15 To: Shashikant Bangera Cc: user

Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
Can you call this service with regular code(No Spark)? On Mon, 07 Jan 2019 02:42:48 -0800 shashikantbang...@discover.com wrote Hi team, please help , we are kind of blocked here. Cheers, Shashi -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Fwd:Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread yeikel valdes
Forwarded Message >From : em...@yeikel.com To : kfehl...@gmail.com Date : Mon, 07 Jan 2019 04:11:22 -0800 Subject : Re: Can an UDF return a custom class other than case class? In this case I am just curious because I'd like to know if it is possible. At the same time

Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Please share a minimum amount of code to try reproduce the issue... On Mon, 07 Jan 2019 00:46:42 -0800 fyyleej...@163.com wrote Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed clus