RE: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-04-25 Thread email
Zahid, Starting with Spark 2.3.0, the Spark team introduced an experimental feature called “Continuous Streaming”[1][2] to enter that space, but in general, Spark streaming operates using micro-batches while Flink operates using the Continuous Flow Operator model. There are many

Serialization or internal functions?

2020-04-04 Thread email
Dear Community, Recently, I had to solve the following problem "for every entry of a Dataset[String], concat a constant value" , and to solve it, I used built-in functions : val data = Seq("A","b","c").toDS scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit("

RE: Caching tables in spark

2019-08-28 Thread email
different processes that read from the same raw data table (around 1.5 TB). Is there a way to read this data once and cache it somehow and to use this data in both processes? Thanks -- Tzahi File Data Engineer <http://www.ironsrc.com/> email <mailto:tzahi.f...@ironsrc.com

What is the compatibility between releases?

2019-06-11 Thread email
Dear Community , >From what I understand , Spark uses a variation of Semantic Versioning[1] , but this information is not enough for me to clarify if it is compatible or not within versions. For example , if my cluster is running Spark 2.3.1 , can I develop using API additions in Spark

RE: Turning off Jetty Http Options Method

2019-04-30 Thread email
If this is correct “This method exposes what all methods are supported by the end point” , I really don’t understand how’s that a security vulnerability considering the OSS nature of this project. Are you adding new endpoints to this webserver? More info about info/other methods :

RE: [EXT] handling skewness issues

2019-04-30 Thread email
Please share the links if they are publicly available. Otherwise please share the name of the talks. Thank you From: Jules Damji Sent: Monday, April 29, 2019 8:04 PM To: Michael Mansour Cc: rajat kumar ; user@spark.apache.org Subject: Re: [EXT] handling skewness issues Yes, indeed! A

RE: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread email
Please note that limit drops the partitions to 1. If it is only 100 records you might be able to fit it in one executor , so limit followed by a write is okay. From: Brandon Geise Sent: Sunday, April 14, 2019 9:54 AM To: Chetan Khatri Cc: Nuthan Reddy ; user Subject: Re: How to

RE: Question about relationship between number of files and initial tasks(partitions)

2019-04-13 Thread email
nication is strictly prohibited. If you have received this Communication in error, please notify the sender immediately by phone or email and permanently delete this Communication from your computer without making a copy. Thank you. -- Thanks, Jason

RE: How can I parse an "unnamed" json array present in a column?

2019-02-24 Thread email
Individual columns are small but the table contains millions of rows with this problem. I am probably overthinking , and I will implement the workaround for now. Thank you for your help @Magnus Nilsson , I will try that for now and wait for the upgrade. From:

RE: How can I parse an "unnamed" json array present in a column?

2019-02-24 Thread email
Unfortunately , I can’t change the source system , so changing the JSON at runtime is the best I can do right now. Is there any preferred way to modify the String other than an UDF or map on the string? At the moment I am modifying it returning a generic type “t” so I can use the same

RE: Difference between Typed and untyped transformation in dataset API

2019-02-23 Thread email
>From what I understand , if the transformation is untyped it will return a >Dataframe , otherwise it will return a Dataset. In the source code you will >see that return type is a Dataframe instead of a Dataset and they should also >be annotated with @group untypedrel. Thus , you could check

RE: How can I parse an "unnamed" json array present in a column?

2019-02-23 Thread email
What you suggested works in Spark 2.3 , but in the version that I am using (2.1) it produces the following exception : found : org.apache.spark.sql.types.ArrayType required: org.apache.spark.sql.types.StructType ds.select(from_json($"news", schema) as "news_parsed").show(false)

RE: Spark streaming filling the disk with logs

2019-02-14 Thread email
I have a quick question about this configuration. Particularly this line : log4j.appender.rolling.file=/var/log/spark/ Where is that path at? At the driver level or for each executor individually? Thank you From: Jain, Abhishek 3. (Nokia - IN/Bangalore) Sent: Thursday, February

What is the recommended way to store records that don't meet a filter?

2019-01-28 Thread email
Community , Given a dataset ds , what is the recommended way to store the records that don't meet a filter? For example : val ds = Seq(1,2,3,4).toDS val f = (i : Integer) => i < 2 val filtered = ds.filter(f(_)) I understand I can do this : val filterNotMet =

RE: Is it possible to rate limit an UDP?

2019-01-12 Thread email
Thank you for your suggestion Ramandeep , but the code is not clear to me. Could you please explain it? Particularly this part : Flowable.zip[java.lang.Long, Row, Row](delSt, itF, new BiFunction[java.lang.Long, Row, Row]() { Also , is it possible to achieve this without third party

Is it possible to rate limit an UDP?

2019-01-08 Thread email
I have a data frame for which I apply an UDF that calls a REST web service. This web service is distributed in only a few nodes and it won't be able to handle a massive load from Spark. Is it possible to rate limit this UDP? For example , something like 100 op/s. If not , what are the

Can an UDF return a custom class other than case class?

2019-01-06 Thread email
Hi , Is it possible to return a custom class from an UDF other than a case class? If so , how can we avoid this exception ? : java.lang.UnsupportedOperationException: Schema for type {custom type} is not supported Full Example : import spark.implicits._ import

Do spark-submit overwrite the Spark session created manually?

2018-12-31 Thread email
Hi Community , When we submit a job using 'spark-submit' passing options like the 'master url' what should be the content of the main class? For example , if I create the session myself : val spark = SparkSession.builder. master("local[*]") .appName("Console")

RE: What are the alternatives to nested DataFrames?

2018-12-29 Thread email
1 - I am not sure how can I do what you suggest for #1 because I use the entries in the initial df to build the query and then from it I get the second df. Could you explain more? 2 - I also thought about doing what you consider in #2 , but if I am not mistaken If I use regular Scala data

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email
I could , but only if I had it beforehand. I do not know what the dataframe is until I pass the query parameter and receive the resultant dataframe inside the iteration. The steps are : Original DF -> Iterate -> Pass every element to a function that takes the element of the original

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email
Shabad , I am not sure what you are trying to say. Could you please give me an example? The result of the Query is a Dataframe that is created after iterating, so I am not sure how could I map that to a column without iterating and getting the values. I have a Dataframe that contains a

What are the alternatives to nested DataFrames?

2018-12-27 Thread email
Hi community , As shown in other answers online , Spark does not support the nesting of DataFrames , but what are the options? I have the following scenario : dataFrame1 = List of Cities dataFrame2 = Created after searching in ElasticSearch for each city in dataFrame1 I've