Re: Spark 2.4 partitions and tasks

2019-02-23 Thread Yeikel
I am following up on this question because I have a similar issue. When is that we need to control the parallelism manually? Skewed partitions? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Difference between One map vs multiple maps

2019-03-04 Thread Yeikel
Considering that I have a Dataframe df , I could run df.map(operation1).map(operation2) or run df.map(logic for both operations). In addition , I could also run df.map(operation3) where operation3 would be : return operation2(operation1()) Similarly , with UDFs, I could build a UDF that does

How can I parse an "unnamed" json array present in a column?

2019-02-22 Thread Yeikel
I have an "unnamed" json array stored in a *column*. The format is the following : column name : news Data : [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] Ideally , I'd like to parse it as : news ARRAY> I've

Re: Dataset experimental interfaces

2019-02-12 Thread yeikel
If you tested it end to end with the current version and it works fine , I'd say go ahead unless there is another similar way. If they change the functionality you can always update it. Regarding "non-experimental" functions ,they could also be marked as deprecated and then removed on later

Re: Pyspark elementwise matrix multiplication

2019-02-13 Thread Yeikel
Elementwise product is described here : https://spark.apache.org/docs/latest/mllib-feature-extraction.html#elementwiseproduct I don't know if it will work with your input thought. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: "where" clause able to access fields not in its schema

2019-02-13 Thread Yeikel
This is indeed strange. To add to the question , I can see that if I use a filter I get an exception (as expected) , so I am not sure what's the difference between the where clause and filter : b.filter(s=> { val bar : String = s.getAs("bar") bar.equals("20") }).show *

Re: "where" clause able to access fields not in its schema

2019-02-13 Thread Yeikel
It seems that we are using the function incorrectly. val a = Seq((1,10),(2,20)).toDF("foo","bar") val b = a.select($"foo") val c = b.where(b("bar") === 20) c.show Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "bar" among

Re: Parallelize Join Problem

2019-04-17 Thread Yeikel
It is hard to tell , but your data may be skewed -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark job running for long time

2019-04-17 Thread Yeikel
We need more information about your job to be able to help you. Please share some snippets or the overall idea of what you are doing -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Spark job running for long time

2019-04-17 Thread Yeikel
Can you share the output of df.explain() ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark SQL API taking longer time than DF API.

2019-04-17 Thread Yeikel
Please share the results of df.explain()[1] for both. That should give us some clues of what the differences are [1]https://github.com/apache/spark/blob/e1c90d66bbea5b4cb97226610701b0389b734651/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L499 -- Sent from:

Re: writing into oracle database is very slow

2019-04-13 Thread Yeikel
Are you sure you only need 10 partitions? Do you get the same performance writing to HDFS with 10 partitions? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Best Practice for Writing data into a Hive table

2019-04-13 Thread Yeikel
Writing to CSV is very slow. >From what I've seen this is the preferred way to write to hive ; myDf.createOrReplaceTempView("mytempTable") sqlContext.sql("create table mytable as select * from mytempTable"); Source :

Re: High level explanation of dropDuplicates

2019-06-12 Thread Yeikel
Nicholas , thank you for your explanation. I am also interested in the example that Rishi is asking for. I am sure mapPartitions may work , but as Vladimir suggests it may not be the best option in terms of performance. @Vladimir Prus , are you aware of any example about writing a "custom

Re: What is the compatibility between releases?

2019-06-20 Thread Yeikel
Hi Community , I am still looking for an answer for this question I am running a cluster using Spark 2.3.1 , but I wondering if it is safe to include Spark 2.4.1 and use new features such as higher order functions. Thank you. -- Sent from:

High level explanation of dropDuplicates

2019-05-20 Thread Yeikel
Hi , I am looking for a high level explanation(overview) on how dropDuplicates[1] works. [1] https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 Could someone please explain? Thank you -- Sent from:

Execute Spark model without Spark

2019-08-22 Thread Yeikel
Hi , I have a GBTClassificationModel that I generated using Spark. How can I export this model and use without a Spark cluster? I would like to serve it outside of Spark -- Sent

Is there any way to set the location of the history for the spark-shell per session?

2020-04-14 Thread Yeikel
In my team , we get elevated access to our Spark cluster using a common username which means that we all share the same history. I am not sure if this is common , but unfortunately there is nothing I can do about it. Is there any option to set the location of the history? I am looking for

Re: How does spark sql evaluate case statements?

2020-04-14 Thread Yeikel
I do not know the answer to this question so I am also looking for it, but @kant maybe the generated code can help with this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread Yeikel
Looking at the results of explain, I can see a CollectLimit step. Does that work the same way as a regular .collect() ? (where all records are sent to the driver?) spark.sql("select * from db.table limit 100").explain(false) == Physical Plan == CollectLimit 100 +- FileScan parquet ...

Re: [Pyspark] - Spark uses all available memory; unrelated to size of dataframe

2020-04-15 Thread Yeikel
The memory that you see in Spark's UI page, under storage is not the memory used by your processing but the amount of memory that you persisted from your RDDs and DataFrames Read more here : https://spark.apache.org/docs/3.0.0-preview/web-ui.html#storage-tab We need more details to be able to

Question about how parquet files are read and processed

2020-04-15 Thread Yeikel
I have a parquet file with millions of records and hundreds of fields that I will be extracting from a cluster with more resources. I need to take that data,derive a set of tables from only some of the fields and import them using a smaller cluster The smaller cluster cannot load in memory the

Re: wot no toggle ?

2020-04-16 Thread Yeikel
I have been reading the list for the last few days and I have seen a lot of messages with similar grammar and criticism that does not add any value. If you don't like something about Spark , discuss it politely or create a Jira ticket. Please stop attacking the community or you'll be blocked.

Re: Is there any way to set the location of the history for the spark-shell per session?

2020-04-16 Thread Yeikel
Thank you. That's what I was looking for. I only found a PR from Scala when I googled it , so if you remember your source , please do so. Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Get Size of a column in Bytes for a Pyspark Dataframe

2020-04-16 Thread Yeikel
As far as I know , one option is to persist it , and check in Spark UI. df.select("field").persist().count() // I'd like to hear other options too. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread Yeikel
Hi Zhang. Thank you for your response While your answer clarifies my confusion with `CollectLimit` it still does not clarify what is the recommended way to extract large amounts of data (but not all the records) from a source and maintain a high level of parallelism. For example , at some

RE: Fast Unit Tests

2018-05-01 Thread Yeikel Santana
Can you share a sample test case? How are you doing the unit tests? Are you creating the session in a beforeAll block or similar? As far as I know, if you use spark you will end up with light integration tests rather than “real” unit tests (please correct me if I am wrong). From:

What is the purpose of CoarseGrainedScheduler and how can I disable it?

2018-04-01 Thread Yeikel Santana
rn/*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/lib /*:/usr/hadoop-2.8.3/share/hadoop/mapreduce/*:/usr/hadoop-2.8.3/share/hadoop /tools/lib/*" "-Xmx1024M" "-Dspark.driver.port=59906" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "

Re:Parquet file number of columns

2019-01-07 Thread yeikel valdes
Not according to Parquet dev group https://groups.google.com/forum/m/#!topic/parquet-dev/jj7TWPIUlYI On Mon, 07 Jan 2019 05:11:51 -0800 gourav.sengu...@gmail.com wrote Hi, Is there any limit to the number of columns that we can have in Parquet file format?  Thanks and Regards,

Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Please share a minimum amount of code to try reproduce the issue... On Mon, 07 Jan 2019 00:46:42 -0800 fyyleej...@163.com wrote Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Ideally...we would like to copy paste and try in our end. A screenshot is not enough. If you have private information just remove and create a minimum example we can use to replicate the issue. I'd say similar to this : https://stackoverflow.com/help/mcve On Mon, 07 Jan 2019 04:15:16

RE: Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
  Shashikant Bangera | DevOps Engineer Payment Services DevOps Engineering Email: shashikantbang...@discover.com Group email: eppdev...@discover.com Tel: +44 (0) Mob: +44 (0) 7440783885     From: yeikel valdes [mailto:em...@yeikel.com] Sent: 07 January 2019 12:15 To: Shashikant Bangera Cc: user

Fwd:Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread yeikel valdes
Forwarded Message >From : em...@yeikel.com To : kfehl...@gmail.com Date : Mon, 07 Jan 2019 04:11:22 -0800 Subject : Re: Can an UDF return a custom class other than case class? In this case I am just curious because I'd like to know if it is possible. At the same

Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
Can you call this service with regular code(No Spark)? On Mon, 07 Jan 2019 02:42:48 -0800 shashikantbang...@discover.com wrote Hi team, please help , we are kind of blocked here. Cheers, Shashi -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re:Does Spark SQL has match_recognize?

2019-05-26 Thread yeikel valdes
Isn't match_recognize just a filter? df.filter(predicate)? On Sat, 25 May 2019 12:55:47 -0700 kanth...@gmail.com wrote Hi All, Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected I believe it is one of the most useful concepts in the Financial

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-10 Thread yeikel valdes
If you need to reduce the number of partitions you could also try df.coalesce On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" )  On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote:

Re:Load Time from HDFS

2019-04-10 Thread yeikel valdes
What about a simple call to nanotime? long startTime = System.nanoTime(); //Spark work here long endTime = System.nanoTime(); long duration = (endTime - startTime) println(duration) Count recomputes the df so it makes sense it takes longer for you. On Tue, 02 Apr 2019 07:06:30 -0700

Re: [External]Re: spark 2.x design docs

2019-09-19 Thread yeikel valdes
I am also interested. Many of the docs/books that I've seen are practical/examples about usage rather than deep internals of Spark. On Wed, 18 Sep 2019 21:12:12 -1100 vipul.s.p...@gmail.com wrote Yes, I realize what you were looking for, I am also looking for the same docs.

What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread yeikel valdes
I am currently using a third party library(Lucene) with Spark that is not serializable. Due to that reason, it generates the following exception  : Job aborted due to stage failure: Task 144.0 in stage 25.0 (TID 2122) had a not serializable result: org.apache.lucene.facet.FacetsConfig

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread yeikel valdes
Can you please explain what you mean with that? How do you use a udf to replace a join? Thanks On Mon, 24 Feb 2020 22:06:40 -0500 jianneng...@workday.com wrote Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF likely won't work. Jianneng From: Liu

Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread yeikel valdes
>From what I understand, the session is a singleton so even if you think you >are creating new instances you are just reusing it.  On Wed, 29 Jan 2020 02:24:05 -1100 icbm0...@gmail.com wrote Dear all I already had a python function which is used to query data from HBase and HDFS

Re: Going it alone.

2020-04-14 Thread yeikel valdes
It depends on your use case. What are you trying to solve?  On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote Hi, I consider myself to be quite good in Software Development especially using frameworks. I like to get my hands  dirty. I have spent the last few

Re: Going it alone.

2020-04-14 Thread yeikel valdes
is headed in my direction.   You are implying  Spark could be. So tell me about the USE CASES and I'll do the rest. On Tuesday, 14 April 2020 yeikel valdes wrote: It depends on your use case. What are you trying to solve?  On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INV

Re: Serialization or internal functions?

2020-04-07 Thread yeikel valdes
Thanks for your input Soma , but I am actually looking to understand the differences and not only on the performance.  On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote If you want to  measure optimisation in terms of time taken , then here is an idea  :)  

Re: IDE suitable for Spark

2020-04-07 Thread yeikel valdes
Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is missing a lot of the features that we expect from an IDE. Thanks for sharing though.  On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote When I first logged on I asked if there was a suitable

What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread yeikel valdes
When I use .limit() , the number of partitions for the returning dataframe is 1 which normally fails most jobs. val df = spark.sql("select * from table limit n") df.write.parquet() Thanks!

Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a shared environment where the ingestion rate needs to be as low as possible (for example, 5 requests/second as an upper limit), and as far as I can

Re: Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Apologies to everyone. I sent this to the wrong email list. Please discard On Mon, 04 Dec 2023 10:48:11 -0500 Yeikel Santana wrote --- Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating