Re: A simple comparison for three SQL engines
may I forward this report to spark list as well. Thanks. Wes Peng wrote: Hello, This weekend I made a test against a big dataset. spark, drill, mysql, postgresql were involved. This is the final report: https://blog.cloudcache.net/handles-the-file-larger-than-memory/ The simple conclusion: 1. spark is the fastest for this scale of data and limited memory 2. drill is close to spark 3. postgresql has surprising behavior in query speed 4. mysql is really slow If you have found any issue please let me know. Thanks Wes Peng wrote: sure.I will take time to do it. Sanel Zukan wrote: Any chance you can try with Postgres >= 12, default configuration with the same indexed columns as with MySQL? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Executorlost failure
I just did a test, even for a single node (local deployment), spark can handle the data whose size is much larger than the total memory. My test VM (2g ram, 2 cores): $ free -m totalusedfree shared buff/cache available Mem: 19921845 92 19 54 36 Swap: 1023 285 738 The data size: $ du -h rate.csv 3.2Grate.csv Loading this file into spark for calculation can be done without error: scala> val df = spark.read.format("csv").option("inferSchema", true).load("skydrive/rate.csv") val df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 more fields] scala> df.printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: double (nullable = true) |-- _c3: integer (nullable = true) scala> df.groupBy("_c1").agg(avg("_c2").alias("avg_rating")).orderBy(desc("avg_rating")).show warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` +--+--+ | _c1|avg_rating| +--+--+ |000136| 5.0| |0001711474| 5.0| |0001360779| 5.0| |0001006657| 5.0| |0001361155| 5.0| |0001018043| 5.0| |000136118X| 5.0| |202010| 5.0| |0001371037| 5.0| |401048| 5.0| |0001371045| 5.0| |0001203010| 5.0| |0001381245| 5.0| |0001048236| 5.0| |0001436163| 5.0| |000104897X| 5.0| |0001437879| 5.0| |0001056107| 5.0| |0001468685| 5.0| |0001061240| 5.0| +--+--+ only showing top 20 rows So as you see spark can handle file larger than its memory well. :) Thanks rajat kumar wrote: With autoscaling can have any numbers of executors. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Executorlost failure
I once had a file which is 100+GB getting computed in 3 nodes, each node has 24GB memory only. And the job could be done well. So from my experience spark cluster seems to work correctly for big files larger than memory by swapping them to disk. Thanks rajat kumar wrote: Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Executorlost failure
how many executors do you have? rajat kumar wrote: Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
query time comparison to several SQL engines
I made a simple test to query time for several SQL engines including mysql, hive, drill and spark. The report, https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf It maybe have no special meaning, just for fun. :) regards. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Profiling spark application
Give a look at this: https://github.com/LucaCanali/sparkMeasure On 2022/1/20 1:18, Prasad Bhalerao wrote: Is there any way we can profile spark applications which will show no. of invocations of spark api and their execution time etc etc just the way jprofiler shows all the details? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Pyspark] How to download Zip file from SFTP location and put in into Azure Data Lake and unzip it
How large is the file? From my experience, reading the excel file from data lake and loading as dataframe, works great. Thanks On 2022-01-18 22:16, Heta Desai wrote: Hello, I have zip files on SFTP location. I want to download/copy those files and put into Azure Data Lake. Once the zip files get stored into Azure Data Lake, I want to unzip those files and read using Data Frames. The file format inside zip is excel. SO, once files are unzipped, I want to read excel files using spark DataFrames. Please help me with the solution as soon as possible. Thanks, Heta Desai | Data | Sr Associate L1 e.heta.de...@1rivet.com | t. +91 966.225.4954 This email, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If you are not the intended recipient, please advise the sender immediately and delete this message and any attachments. Unless otherwise specifically stated in this email, transaction related information in this email, including attachments, is not to be construed as an offer, solicitation or the basis or confirmation for any contract for the purchase/sale of any services. All email sent to or from this address will be received by 1Rivet US, Inc and is subject to archival retention and review by someone other than the recipient. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: ivy unit test case filing for Spark
Are you using IvyVPN which causes this problem? If the VPN software changes the network URL silently you should avoid using them. Regards. On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar wrote: > Hi Spark Team > > I am building a spark in VPN . But the unit test case below is failing. > This is pointing to ivy location which cannot be reached within VPN . Any > help would be appreciated > > test("SPARK-33084: Add jar support Ivy URI -- default transitive = true") > { > *sc *= new SparkContext(new > SparkConf().setAppName("test").setMaster("local-cluster[3, > 1, 1024]")) > *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*") > assert(*sc*.listJars().exists(_.contains( > "org.apache.hive_hive-storage-api-2.7.0.jar"))) > assert(*sc*.listJars().exists(_.contains( > "commons-lang_commons-lang-2.6.jar"))) > } > > Error > > - SPARK-33084: Add jar support Ivy URI -- default transitive = true *** > FAILED *** > java.lang.RuntimeException: [unresolved dependency: > org.apache.hive#hive-storage-api;2.7.0: not found] > at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates( > SparkSubmit.scala:1447) > at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies( > DependencyUtils.scala:185) > at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies( > DependencyUtils.scala:159) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928) > at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite. > scala:1041) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > > Regards > Pralabh Kumar > > >