Re: Needed some best practices to integrate Spark with HBase

2020-07-20 Thread YogeshGovi
I also need good docs on this. Especially integrating pyspark with hive reading tables from hbase. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Insert overwrite using select within same table

2020-07-20 Thread Umesh Bansal
You can use temp table to insert the data and then use temp table to insert back to main table On Tue, 21 Jul 2020 at 12:01 AM, Utkarsh Jain wrote: > Hello community, > > I am not sure that using 'insert overwrite using select within same table' > via spark SQL is a good approach or is there

Re: Future timeout

2020-07-20 Thread Terry Kim
"spark.sql.broadcastTimeout" is the config you can use: https://github.com/apache/spark/blob/fe07521c9efd9ce0913eee0d42b0ffd98b1225ec/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L863 Thanks, Terry On Mon, Jul 20, 2020 at 11:20 AM Amit Sharma wrote: > Please help on

Insert overwrite using select within same table

2020-07-20 Thread Utkarsh Jain
Hello community, I am not sure that using 'insert overwrite using select within same table' via spark SQL is a good approach or is there any other best alternative approach to do this. Thanks Utkarsh

Re: Garbage collection issue

2020-07-20 Thread Russell Spitzer
High GC is relatively hard to debug in general but I can give you a few pointers. This basically means that the time spent cleaning up unused objects is high which usually means memory is be used and thrown away rapidly. It can also mean that GC is ineffective, and is being run many times in an

Re: Garbage collection issue

2020-07-20 Thread Amit Sharma
Please help on this. Thanks Amit On Fri, Jul 17, 2020 at 2:34 PM Amit Sharma wrote: > Hi All, i am running the same batch job in my two separate spark clusters. > In one of the clusters it is showing GC warning on spark -ui under > executer tag. Garbage collection is taking longer time

Re: Garbage collection issue

2020-07-20 Thread Jeff Evans
What is your heap size, and JVM vendor/version? Generally, G1 only outperforms CMS on large heap sizes (ex: 31GB or larger). On Mon, Jul 20, 2020 at 1:22 PM Amit Sharma wrote: > Please help on this. > > > Thanks > Amit > > On Fri, Jul 17, 2020 at 2:34 PM Amit Sharma wrote: > >> Hi All, i am

Insert overwrite using select with in same table

2020-07-20 Thread Utkarsh Jain
Hello community, I am not sure that using 'insert overwrite using select within same table' via spark SQL is a good approach or is there any other best alternative approach to do this.

Re: Future timeout

2020-07-20 Thread Amit Sharma
Please help on this. Thanks Amit On Fri, Jul 17, 2020 at 9:10 AM Amit Sharma wrote: > Hi, sometimes my spark streaming job throw this exception Futures timed > out after [300 seconds]. > I am not sure where is the default timeout configuration. Can i increase > it. Please help. > > > >

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread fansparker
Makes sense, Russell. I am trying to figure out if there is a way to enforce metadata reload at "createRelation" if the provided schema in the new sparkSession is different than the existing metadata schema. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread Russell Spitzer
The code you linked to is very old and I don't think that method works anymore (Hive context not existing anymore). My latest attempt at trying this was Spark 2.2 and I ran into the issues I wrote about before. In DSV2 it's done via a catalog implementation, so you basically can write a new

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without seeing the rest (and you can confirm this by looking at the DAG visualization in the Spark UI) I would say your first stage with 6 partitions is: Stage 1: Read data from kinesis (or read blocks from receiver not sure what method you are using) and write shuffle files for repartition Stage

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread fansparker
Thanks Russell. This shows that the "refreshTable" and "invalidateTable" could be used to reload the metadata but they do not work in our case. I have tried to

Re: Spark UI

2020-07-20 Thread ArtemisDev
Thanks Xiao for the info.  I was looking for this, too.  This page wasn't linked from anywhere on the main doc page (Overview) or any of the pull-down menus.  Someone should remind the doc team to update the table of contents on the Overview page. -- ND On 7/19/20 10:30 PM, Xiao Li wrote:

How to monitor the throughput and latency of the combineByKey transformation in Spark 3?

2020-07-20 Thread Felipe Gutierrez
Hi community, I built a simple count and sum spark application which uses the combineByKey transformation [1] and I would like to monitor the throughput in/out of this transformation and the latency that the combineByKey spends to pre-aggregate tuples. Ideally, the latency I would like to take

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. Its Dstreams reading for every 10secs from kinesis stream and after transformations, pushing into hbase. Once got Dstream, we are using below code to repartition and do processing: dStream = dStream.repartition(javaSparkContext.defaultMinPartitions()

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. We are fetching Dstreams from kinesis stream for every 10sec and performing transformations and finally persisting to hbase tables using batch insertions. dStream = dStream.repartition(jssc.defaultMinPartitions() * 3); dStream.foreachRDD(javaRDD ->

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without your code this is hard to determine but a few notes. The number of partitions is usually dictated by the input source, see if it has any configuration which allows you to increase input splits. I'm not sure why you think some of the code is running on the driver. All methods on

Does Spark support column scan pruning for array of structs?

2020-07-20 Thread Haijia Zhou
I have a data frame in following schema: household root |-- country_code: string (nullable = true) |-- region_code: string (nullable = true) |-- individuals: array (nullable = true) ||-- element: struct (containsNull = true) |||-- individual_id: string (nullable = true) ||

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread Russell Spitzer
The last I looked into this the answer is no. I believe since there is a Spark Session internal relation cache, the only way to update a sessions information was a full drop and create. That was my experience with a custom hive metastore and entries read from it. I could change the entries in the

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread Piyush Acharya
Do you want to merge the schema when incoming data is changed? spark.conf.set("spark.sql.parquet.mergeSchema", "true") https://kontext.tech/column/spark/381/schema-merging-evolution-with-parquet-in-spark-and-hive On Mon, Jul 20, 2020 at 3:48 PM fansparker wrote: > Does anybody know if there

Re: Spark Structured Streaming keep on consuming usercache

2020-07-20 Thread Piyush Acharya
Can you try calling batchDF.unpersist() once the work is done in loop? On Mon, Jul 20, 2020 at 3:38 PM Yong Yuan wrote: > It seems the following structured streaming code keeps on consuming > usercache until all disk space are occupied. > > val monitoring_stream = >

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-20 Thread Ben Smith
Thanks for that. I have played with this a bit more after your feedback and found: I can only recreate the problem with python 3.6+. If I change between python 2.7, python 3.6 and python 3.7 I find that the problem occurs in the python 3.6 and 3.7 case but not in the python 2.7. - I have used

Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
I am new to spark streaming and trying to understand spark ui and to do optimizations. 1. Processing at executors took less time than at driver. How to optimize to make driver tasks fast ? 2. We are using dstream.repartition(defaultParallelism*3) to increase parallelism which is causing high

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
In Spark 3.0, if you use the `with-hadoop` Spark distribution that has embedded Hadoop 3.2, you can set `spark.yarn.populateHadoopClasspath=false` to not populate the cluster's hadoop classpath. In this scenario, Spark will use hadoop 3.2 client to connect to hadoop 2.6 which should work fine. In

Re: schema changes of custom data source in persistent tables DataSourceV1

2020-07-20 Thread fansparker
Does anybody know if there is a way to get the persisted table's schema updated when the underlying custom data source schema is changed? Currently, we have to drop and re-create the table. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark Deployment Strategy

2020-07-20 Thread codingkapoor
I want to understand how best to deploy spark close to a data source or sink. Let's say, I have a vertica cluster that I need to run spark job on. In that case how should spark cluster be setup? 1. Should we run a spark worker node on each vertica cluster node? 2. How about when shuffling

Spark ETL use case

2020-07-20 Thread codingkapoor
Can we use spark as a ETL service? Suppose we have data written to our cassandra data stores and we need to transform and load the same to vertica for analytics purposes. Since spark is already a very well designed distributed system, wouldn't it make sense to load data from cass into spark

Spark Structured Streaming keep on consuming usercache

2020-07-20 Thread Yong Yuan
It seems the following structured streaming code keeps on consuming usercache until all disk space are occupied. val monitoring_stream = monitoring_df.writeStream .trigger(Trigger.ProcessingTime("120 seconds")) .foreachBatch { (batchDF: DataFrame,

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
If it's standalone mode, it's even easier. You should be able to connect to hadoop 2.6 hdfs using 3.2 client. In your k8s cluster, just don't put hadoop 2.6 into your classpath. On Sun, Jul 19, 2020 at 10:25 PM Ashika Umanga Umagiliya wrote: > > Hello > > "spark.yarn.populateHadoopClasspath" is