unsubscribe

2023-11-07 Thread Kelvin Qin
unsubscribe

Re:Question about how parquet files are read and processed

2020-04-15 Thread Kelvin Qin
Hi,
The advantage of Parquet is that it only scans the required columns, it is a 
file in a column storage format. 
The fewer columns you select, the less memory is required. 
Developers do not need to care about the details of loading data, they are 
well-designed and imperceptible to users.







At 2020-04-16 11:00:32, "Yeikel"  wrote:
>I have a parquet file with millions of records and hundreds of fields that I
>will be extracting from a cluster with more resources. I need to take that
>data,derive a set of tables from only some of the fields and import them
>using a smaller cluster
>
>The smaller cluster cannot load in memory the entire parquet file , but it
>can load the derived tables.
>
>if I am reading a parquet file , and I only select a few fields , how much
>computing power do I need compared to all the columns? is it different?  Do
>I need more or less computing power depending on the number of columns I
>select , or does it depend more on the raw source itself and the number of
>columns it contains?
>
>One suggestion I received from a college was to derive the tables using the
>larger cluster and just import them in the smaller cluster , but I was
>wondering if that's really necessary considering that after the import , I
>won't be use the dumps anymore.
>
>I hope my question makes sense. 
>
>Thanks for your help!
>
>
>
>
>
>
>
>--
>Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>-
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re:[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Kelvin Qin



SEE:http://spark.apache.org/docs/2.3.1/streaming-programming-guide.html#checkpointing
"Note that checkpointing of RDDs incurs the cost of saving to reliable storage. 
This may cause an increase in the processing time of those batches where RDDs 
get checkpointed."


As far as I know, the official documentation states that the checkpoint of the 
spark streaming application will continue to increase over time.
Whereas data or RDD checkpointing is necessary even for basic functioning if 
stateful transformations are used.
So,for applications that require long-term aggregation, I choose to use 
third-party caches in production, such as redis. Maybe you can try Alluxio




Wishes!







在 2020-04-16 08:19:24,"Ahn, Daniel"  写道:

Are Spark Structured Streaming checkpoint files expected to grow over time 
indefinitely? Is there a recommended way to safely age-off old checkpoint data?

 

Currently we have a Spark Structured Streaming process reading from Kafka and 
writing to an HDFS sink, with checkpointing enabled and writing to a location 
on HDFS. This streaming application has been running for 4 months and over time 
we have noticed that with every 10th job within the application there is about 
a 5 minute delay between when a job finishes and the next job starts which we 
have attributed to the checkpoint compaction process. At this point the 
.compact file that is written is about 2GB in size and the contents of the file 
show it keeps track of files it processed at the very origin of the streaming 
application.

 

This issue can be reproduced with any Spark Structured Streaming process that 
writes checkpoint files.

 

Is the best approach for handling the growth of these files to simply delete 
the latest .compact file within the checkpoint directory, and are there any 
associated risks with doing so?

 


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

Re:RE: Going it alone.

2020-04-15 Thread Kelvin Qin
No wonder I said why I can't understand what the mail expresses, it feels like 
a joke……
















在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道:

Have we been tricked by a bot ?

 

From: Matt Smith 
Sent: Wednesday, April 15, 2020 2:23 PM
To: jane thorpe
Cc: dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com; 
em...@yeikel.com
Subject: Re: Going it alone.

 

|

CAUTION EXTERNAL EMAIL: DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS THAT ARE 
UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO SPAM SUBMISSIONS.

|

This is so entertaining.

 

1. Ask for help

2. Compare those you need help from to a lower order primate.

3. Claim you provided information you did not

4. Explain that providing any information would be "too revealing"

5. ???

 

Can't wait to hear what comes next, but please keep it up.  This is a bright 
spot in my day.

 

 

On Tue, Apr 14, 2020 at 4:47 PM jane thorpe  wrote:

I did write a long email in response to you.
But then I deleted it because I felt it would be too revealing.






On Tuesday, 14 April 2020 David Hesson  wrote:

I want to know  if Spark is headed in my direction.

You are implying  Spark could be. 

 

What direction are you headed in, exactly? I don't feel as if anything were 
implied when you were asked for use cases or what problem you are solving. You 
were asked to identify some use cases, of which you don't appear to have any.

 

On Tue, Apr 14, 2020 at 4:49 PM jane thorpe  wrote:

That's what  I want to know,  Use Cases.
I am looking for  direction as I described and I want to know  if Spark is 
headed in my direction.  

You are implying  Spark could be.

So tell me about the USE CASES and I'll do the rest.

On Tuesday, 14 April 2020 yeikel valdes  wrote:

It depends on your use case. What are you trying to solve? 

 


 On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALIDwrote 

Hi,

I consider myself to be quite good in Software Development especially using 
frameworks.

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures.

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?





 

PLEASE READ: This message is for the named person's use only. It may contain 
confidential, proprietary or legally privileged information. No confidentiality 
or privilege is waived or lost by any mistransmission. If you receive this 
message in error, please delete it and all copies from your system, destroy any 
hard copies and notify the sender. You must not, directly or indirectly, use, 
disclose, distribute, print, or copy any part of this message if you are not 
the intended recipient. Nomura Holding America Inc., Nomura Securities 
International, Inc, and their respective subsidiaries each reserve the right to 
monitor all e-mail communications through its networks. Any views expressed in 
this message are those of the individual sender, except where the message 
states otherwise and the sender is authorized to state the views of such 
entity. Unless otherwise stated, any pricing information in this message is 
indicative only, is subject to change and does not constitute an offer to deal 
at any price quoted. Any reference to the terms of executed transactions should 
be treated as preliminary only and subject to our formal written confirmation.