Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
e or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 31 Ma

Re: Spark Push-Based Shuffle causing multiple stage failures

2022-05-25 Thread Han Altae-Tran
site.xml for NodeManager configurations: > > spark.shuffle.push.server.mergedShuffleFileManagerImpl > org.apache.spark.network.shuffle.RemoteBlockPushResolver > > > On Tue, May 24, 2022 at 3:30 PM Mridul Muralidharan > wrote: > >> +CC zhouye...@gmail.com >

Spark Push-Based Shuffle causing multiple stage failures

2022-05-23 Thread Han Altae-Tran
Hi, First of all, I am very thankful for all of the amazing work that goes into this project! It has opened up so many doors for me! I am a long time Spark user, and was very excited to start working with the push-based shuffle service for an academic paper we are working on, but I encountered

Listening to ExternalCatalogEvent in Spark 3

2021-11-24 Thread Khai Tran
Hello community, Previously in Spark 2.4, we listen and capture ExternalCatalogEvent on "onOtherEvent()" method of SparkListener, but with Spark 3, we no longer see those events. Just wonder if there is any behavior change for emitting ExternalCatalogEvent in Spark 3, and if yes, where should I

unsubscribe

2019-12-09 Thread Calvin Tran
unsubscribe On Dec. 9, 2019 6:59 a.m., "Areg Baghdasaryan (BLOOMBERG/ 731 LEX)" wrote: This e-mail (and any attachments) is intended only for the use of the addressee and may contain confidential and privileged information. If you are not the intended recipient, any collection, use,

Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
artitions > > This control the number of files generated. > > On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote: > >> Hi Denny, >> Thank you for your inputs. I also use 128 MB but still too many files >> generated by Spark app which is only ~14

Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Ha's presentation > Data > Storage Tips for Optimal Spark Performance > <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>. > > > On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: > >> Hi Everyone,

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Hi Everyone, Does anyone know what is the best practise of writing parquet file from Spark ? As Spark app write data to parquet and it shows that under that directory there are heaps of very small parquet file (such as e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 15KB

Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
Hi Everyone, Does anyone know how could we extract timestamp from Kafka message in Spark streaming ? JavaPairInputDStream messagesDStream = KafkaUtils.createDirectStream( ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,

Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Kevin Tran
Hi Everyone, I tried in cluster mode on YARN * spark-submit --jars /path/sqldriver.jar * --driver-class-path * spark-env.sh SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/*" * spark-defaults.conf spark.driver.extraClassPath spark.executor.extraClassPath None of them works for me ! Does

Re: call() function being called 3 times

2016-09-07 Thread Kevin Tran
] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 12.0 (TID 12). 2518 bytes result sent to driver Does anyone have any ideas? On Wed, Sep 7, 2016 at 7:30 PM, Kevin Tran <kevin...@gmail.com> wrote: > Hi Everyone, > Does anyone know why call() function bei

call() function being called 3 times

2016-09-07 Thread Kevin Tran
Hi Everyone, Does anyone know why call() function being called *3 times* for each message arrive JavaDStream message = messagesDStream.map(new >> Function, String>() { > > @Override > > public String call(Tuple2 tuple2) { > > return tuple2._2(); > > } > >

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
rdpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable

Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone, Please give me your opinions on what is the best ID Generator for ID field in parquet ? UUID.randomUUID(); AtomicReference currentTime = new AtomicReference<>(System.currentTimeMillis()); AtomicLong counter = new AtomicLong(0); Thanks, Kevin.

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
ny and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destru

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.

Spark StringType could hold how many characters ?

2016-08-28 Thread Kevin Tran
Hi, I wrote to parquet file as following: ++ |word| ++ |THIS IS MY CHARACTERS ...| |// ANOTHER LINE OF CHAC...| ++ These lines are not full text and it is being trimmed down. Does anyone know how many chacters StringType

Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone, Does anyone know how to write parquet file after parsing data in Spark Streaming? Thanks, Kevin.

Doing record linkage using string comparators in Spark

2016-07-13 Thread Linh Tran
Hi guys, I'm hoping that someone can help me to make my setup more efficient. I'm trying to do record linkage across 2.5 billion records and have set myself up in Spark to handle the data. Right as of now, I'm relying on R (with the stringdist and RecordLinkage packages) to do the actual

Kryo serialization fails when using SparkSQL and HiveContext

2015-12-14 Thread Linh M. Tran
) and submitting Spark application to Yarn in cluster mode. Any help is appreciated. -- Linh M. Tran

Spark Streaming Standalone 1.5 - Stage cancelled because SparkContext was shut down

2015-09-29 Thread An Tran
Hello All, I have several Spark Streaming applications running on Standalone mode in Spark 1.5. Spark is currently set up for dynamic resource allocation. The issue I am seeing is that I can have about 12 Spark Streaming Jobs running concurrently. Occasionally I would see more than half where

Spark SQL Error

2015-07-27 Thread An Tran
Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark integration. Below is the series of command I issued along with the stacktrace. I am unclear what the error could mean. I can print the schema correctly but error out if i try and display a