Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Jason Nerothin
The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite: https://databricks.com/glossary/catalyst-optimizer On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: > Was there a qualitative or quantitative benchmark done before a

alternatives to shading

2019-12-17 Thread Jason Nerothin
Our build is complex; it uses a large number of third party jars and generates an uber jar that is shaded before we pass it to spark submit. We shade to avoid ClassLoader collisions with Spark platform dependencies (e.g. protobuf 3). Managing the dependencies/shade is cumbersome and error prone.

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Yes. If the job fails repeatedly (4 times in this case), Spark assumes that there is a problem in the Job and notifies the user. In exchange for this, the engine can go on to serve other jobs with its available resources. I would try the following until things improve: 1. Figure out what's

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Correction: The Driver manages the Tasks, the resource manager serves up resources to the Driver or Task. On Tue, May 21, 2019 at 9:11 AM Jason Nerothin wrote: > The behavior is a deliberate design decision by the Spark team. > > If Spark were to "fail fast", it would prev

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
The behavior is a deliberate design decision by the Spark team. If Spark were to "fail fast", it would prevent the system from recovering from many classes of errors that are in principle recoverable (for example if two otherwise unrelated jobs cause a garbage collection spike on the same node).

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jason Nerothin
I did a quick google search for "Java/Scala interoperability" and was surprised to find very few recent results on the topic. (Has the world given up?) It's easy to use Java library code from Scala, but the opposite is not true. I would think about the problem this way: Do *YOU* need to provide

Re: Streaming job, catch exceptions

2019-05-12 Thread Jason Nerothin
Code would be very helpful, but it *seems like* you are: 1. Writing in Java 2. Wrapping the *entire app *in a try/catch 3. Executing in local mode The code that is throwing the exceptions is not executed locally in the driver process. Spark is executing the failing code on the cluster. On Sun,

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
See also here: https://stackoverflow.com/questions/44671597/how-to-replace-null-values-with-a-specific-value-in-dataframe-using-spark-in-jav On Mon, Apr 29, 2019 at 5:27 PM Jason Nerothin wrote: > Spark SQL has had an na.fill function on it since at least 2.1. Would that > work f

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
Spark SQL has had an na.fill function on it since at least 2.1. Would that work for you? https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html On Mon, Apr 29, 2019 at 4:57 PM Shixiong(Ryan) Zhu wrote: > Hey Snehasish, > > Do you have a reproducer for this

Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan, Do you have to use Parquet? It just feels like it might be the wrong sink for a high-frequency change scenario. What are you trying to accomplish? Thanks, Jason On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri wrote: > Hello All, > > If I am doing incremental load / delta and would

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Jason Nerothin
Hi Rajat, A little more color: The executor classpath will be used by the spark workers/slaves. For example, all JVMs that are started with $SPARK_HOME/sbin/start-slave.sh. If you run with --deploy-mode cluster, then the driver itself will be run from on the cluster (with executor classpath).

Re: Spark2: Deciphering saving text file name

2019-04-09 Thread Jason Nerothin
Hi Subash, Short answer: It’s effectively random. Longer answer: In general the DataFrameWriter expects to be receiving data from multiple partitions. Let’s say you were writing to ORC instead of text. In this case, even when you specify the output path, the writer creates a directory at the

Re: Structured streaming flatMapGroupWithState results out of order messages when reading from Kafka

2019-04-09 Thread Jason Nerothin
I had that identical problem. Here’s what I came up with: https://github.com/ubiquibit-inc/sensor-failure On Tue, Apr 9, 2019 at 04:37 Akila Wajirasena wrote: > Hi > > I have a Kafka topic which is already loaded with data. I use a stateful > structured streaming pipeline using

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
am not sure how that helps here > > On Fri, 5 Apr 2019, 10:13 pm Jason Nerothin, > wrote: > >> Have you looked at Arbitrary Stateful Streaming and Broadcast >> Accumulators? >> >> On Fri, Apr 5, 2019 at 10:55 AM Basavaraj wrote: >> >>> Hi

Re: combineByKey

2019-04-05 Thread Jason Nerothin
(x.Id, x.value))).aggregateByKey(Set[String]())( > (aggr, value) => aggr ++ Set(value._2), > (aggr1, aggr2) => aggr1 ++ aggr2).collect().toMap > > print(result) > > Map(0-d1 -> Set(t1, t2, t3, t4), 0-d2 -> Set(t1, t5, t6, t2), 0-d3 -> > Set(t1, t2

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
Have you looked at Arbitrary Stateful Streaming and Broadcast Accumulators? On Fri, Apr 5, 2019 at 10:55 AM Basavaraj wrote: > Hi > > Have two questions > > #1 > I am trying to process events in realtime, outcome of the processing has > to find a node in the GraphX and update that node as well

Re: combineByKey

2019-04-05 Thread Jason Nerothin
I broke some of your code down into the following lines: import spark.implicits._ val a: RDD[Messages]= sc.parallelize(messages) val b: Dataset[Messages] = a.toDF.as[Messages] val c: Dataset[(String, (String, String))] = b.map{x => (x.timeStamp + "-" + x.Id, (x.Id, x.value))}

Re: reporting use case

2019-04-04 Thread Jason Nerothin
Hi Prasad, Could you create an Oracle-side view that captures only the relevant records and the use Spark JDBC connector to load the view into Spark? On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao wrote: > Hi, > > I am exploring spark for my Reporting application. > My use case is as

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
My thinking is that if you run everything in one partition - say 12 GB - then you don't experience the partitioning problem - one partition will have all duplicates. If that's not the case, there are other options, but would probably require a design change. On Thu, Apr 4, 2019 at 8:46 AM Jason

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-04 Thread Jason Nerothin
Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" ) On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote: > Hi Sparkers, > > I noticed that in my spark application, the number of tasks in the first > stage is equal to the number of files read by the

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
How much memory do you have per partition? On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri wrote: > I will get the information and will share with you. > > On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari > wrote: > >> How long does it take to do the window solution ? (Also mention how many >>

Re: How to extract data in parallel from RDBMS tables

2019-04-02 Thread Jason Nerothin
er of tables. > > > On Fri, Mar 29, 2019 at 5:04 AM Jason Nerothin > wrote: > >> How many tables? What DB? >> >> On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < >> surendra.manchika...@gmail.com> wrote: >> >>> Hi Jason, >>> &g

Re: Spark SQL API taking longer time than DF API.

2019-03-30 Thread Jason Nerothin
Can you please quantify the difference and provide the query code? On Fri, Mar 29, 2019 at 9:11 AM neeraj bhadani wrote: > Hi Team, >I am executing same spark code using the Spark SQL API and DataFrame > API, however, Spark SQL is taking longer than expected. > > PFB Sudo code. > >

Re: How to extract data in parallel from RDBMS tables

2019-03-29 Thread Jason Nerothin
How many tables? What DB? On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Jason, > > Thanks for your reply, But I am looking for a way to parallelly extract > all the tables in a Database. > > > On Thu, Mar 28, 201

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Meant this one: https://docs.databricks.com/api/latest/jobs.html On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel wrote: > Thanks, are you referring to > https://github.com/spark-jobserver/spark-jobserver or the undocumented > REST job server included in Spark? > > > From: Jason

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Check out the Spark Jobs API... it sits behind a REST service... On Thu, Mar 28, 2019 at 12:29 Pat Ferrel wrote: > ;-) > > Great idea. Can you suggest a project? > > Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only > launches trivially in test apps since most uses are

Re: How to extract data in parallel from RDBMS tables

2019-03-28 Thread Jason Nerothin
Yes. If you use the numPartitions option, your max parallelism will be that number. See also: partitionColumn, lowerBound, and upperBound https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html On Wed, Mar 27, 2019 at 23:06 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote:

streaming - absolute maximum

2019-03-25 Thread Jason Nerothin
Hello, I wish to calculate the most recent event time from a Stream. Something like this: val timestamped = records.withColumn("ts_long", unix_timestamp($"eventTime")) val lastReport = timestamped .withWatermark("eventTime", "4 hours") .groupBy(col("eventTime"),

Re: DeepSpark: where to start

2016-05-05 Thread Jason Nerothin
Just so that there is no confusion, there is a Spark user interface project called DeepSense that is actually useful: http://deepsense.io I am not affiliated with them in any way... On Thu, May 5, 2016 at 9:42 AM, Joice Joy wrote: > What the heck, I was already beginning

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
s to restart from the last > commited offset > > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) + topic > subscr

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
will allow us to restart from the last > commited offset > > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) +

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
It the main concern uptime or disaster recovery? > On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: > > I think the bigger question is what happens to Kafka and your downstream data > store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks, Jason >

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
use stateless session in spark streaming job. > But now my question is when the rule update, how to pass it to RDD? > We generate a ruleExecutor(stateless session) in main method, > Then pass the ruleExectutor in Rdd. > > I am new in drools, I am trying to read the drools doc now.

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
The limitation is in the drools implementation. Changing a rule in a stateful KB is not possible, particularly if it leads to logical contradictions with the previous version or any other rule in the KB. When we ran into this, we worked around (part of) it by salting the rule name with a