unsubscribe
unsubscribe
Re:Question about how parquet files are read and processed
Hi, The advantage of Parquet is that it only scans the required columns, it is a file in a column storage format. The fewer columns you select, the less memory is required. Developers do not need to care about the details of loading data, they are well-designed and imperceptible to users. At 2020-04-16 11:00:32, "Yeikel" wrote: >I have a parquet file with millions of records and hundreds of fields that I >will be extracting from a cluster with more resources. I need to take that >data,derive a set of tables from only some of the fields and import them >using a smaller cluster > >The smaller cluster cannot load in memory the entire parquet file , but it >can load the derived tables. > >if I am reading a parquet file , and I only select a few fields , how much >computing power do I need compared to all the columns? is it different? Do >I need more or less computing power depending on the number of columns I >select , or does it depend more on the raw source itself and the number of >columns it contains? > >One suggestion I received from a college was to derive the tables using the >larger cluster and just import them in the smaller cluster , but I was >wondering if that's really necessary considering that after the import , I >won't be use the dumps anymore. > >I hope my question makes sense. > >Thanks for your help! > > > > > > > >-- >Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >- >To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re:[Structured Streaming] Checkpoint file compact file grows big
SEE:http://spark.apache.org/docs/2.3.1/streaming-programming-guide.html#checkpointing "Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed." As far as I know, the official documentation states that the checkpoint of the spark streaming application will continue to increase over time. Whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used. So,for applications that require long-term aggregation, I choose to use third-party caches in production, such as redis. Maybe you can try Alluxio Wishes! 在 2020-04-16 08:19:24,"Ahn, Daniel" 写道: Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data? Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled and writing to a location on HDFS. This streaming application has been running for 4 months and over time we have noticed that with every 10th job within the application there is about a 5 minute delay between when a job finishes and the next job starts which we have attributed to the checkpoint compaction process. At this point the .compact file that is written is about 2GB in size and the contents of the file show it keeps track of files it processed at the very origin of the streaming application. This issue can be reproduced with any Spark Structured Streaming process that writes checkpoint files. Is the best approach for handling the growth of these files to simply delete the latest .compact file within the checkpoint directory, and are there any associated risks with doing so? This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
Re:RE: Going it alone.
No wonder I said why I can't understand what the mail expresses, it feels like a joke…… 在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道: Have we been tricked by a bot ? From: Matt Smith Sent: Wednesday, April 15, 2020 2:23 PM To: jane thorpe Cc: dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com; em...@yeikel.com Subject: Re: Going it alone. | CAUTION EXTERNAL EMAIL: DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS THAT ARE UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO SPAM SUBMISSIONS. | This is so entertaining. 1. Ask for help 2. Compare those you need help from to a lower order primate. 3. Claim you provided information you did not 4. Explain that providing any information would be "too revealing" 5. ??? Can't wait to hear what comes next, but please keep it up. This is a bright spot in my day. On Tue, Apr 14, 2020 at 4:47 PM jane thorpe wrote: I did write a long email in response to you. But then I deleted it because I felt it would be too revealing. On Tuesday, 14 April 2020 David Hesson wrote: I want to know if Spark is headed in my direction. You are implying Spark could be. What direction are you headed in, exactly? I don't feel as if anything were implied when you were asked for use cases or what problem you are solving. You were asked to identify some use cases, of which you don't appear to have any. On Tue, Apr 14, 2020 at 4:49 PM jane thorpe wrote: That's what I want to know, Use Cases. I am looking for direction as I described and I want to know if Spark is headed in my direction. You are implying Spark could be. So tell me about the USE CASES and I'll do the rest. On Tuesday, 14 April 2020 yeikel valdes wrote: It depends on your use case. What are you trying to solve? On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALIDwrote Hi, I consider myself to be quite good in Software Development especially using frameworks. I like to get my hands dirty. I have spent the last few months understanding modern frameworks and architectures. I am looking to invest my energy in a product where I don't have to relying on the monkeys which occupy this space we call software development. I have found one that meets my requirements. Would Apache Spark be a good Tool for me or do I need to be a member of a team to develop products using Apache Spark ? PLEASE READ: This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or lost by any mistransmission. If you receive this message in error, please delete it and all copies from your system, destroy any hard copies and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. Nomura Holding America Inc., Nomura Securities International, Inc, and their respective subsidiaries each reserve the right to monitor all e-mail communications through its networks. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state the views of such entity. Unless otherwise stated, any pricing information in this message is indicative only, is subject to change and does not constitute an offer to deal at any price quoted. Any reference to the terms of executed transactions should be treated as preliminary only and subject to our formal written confirmation.