Re: A simple comparison for three SQL engines

2022-04-09 Thread Wes Peng

may I forward this report to spark list as well.

Thanks.

Wes Peng wrote:

Hello,

This weekend I made a test against a big dataset. spark, drill, mysql, 
postgresql were involved.


This is the final report:
https://blog.cloudcache.net/handles-the-file-larger-than-memory/

The simple conclusion:
1. spark is the fastest for this scale of data and limited memory
2. drill is close to spark
3. postgresql has surprising behavior in query speed
4. mysql is really slow

If you have found any issue please let me know.

Thanks

Wes Peng wrote:

sure.I will take time to do it.


Sanel Zukan wrote:

Any chance you can try with Postgres >= 12, default configuration with
the same indexed columns as with MySQL?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Executorlost failure

2022-04-07 Thread Wes Peng
I just did a test, even for a single node (local deployment), spark can 
handle the data whose size is much larger than the total memory.


My test VM (2g ram, 2 cores):

$ free -m
  totalusedfree  shared  buff/cache 
available
Mem:   19921845  92  19  54 
 36

Swap:  1023 285 738


The data size:

$ du -h rate.csv
3.2Grate.csv


Loading this file into spark for calculation can be done without error:

scala> val df = spark.read.format("csv").option("inferSchema", 
true).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 
more fields]


scala> df.printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: integer (nullable = true)


scala> 
df.groupBy("_c1").agg(avg("_c2").alias("avg_rating")).orderBy(desc("avg_rating")).show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
+--+--+ 


|   _c1|avg_rating|
+--+--+
|000136|   5.0|
|0001711474|   5.0|
|0001360779|   5.0|
|0001006657|   5.0|
|0001361155|   5.0|
|0001018043|   5.0|
|000136118X|   5.0|
|202010|   5.0|
|0001371037|   5.0|
|401048|   5.0|
|0001371045|   5.0|
|0001203010|   5.0|
|0001381245|   5.0|
|0001048236|   5.0|
|0001436163|   5.0|
|000104897X|   5.0|
|0001437879|   5.0|
|0001056107|   5.0|
|0001468685|   5.0|
|0001061240|   5.0|
+--+--+
only showing top 20 rows


So as you see spark can handle file larger than its memory well. :)

Thanks


rajat kumar wrote:

With autoscaling can have any numbers of executors.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Executorlost failure

2022-04-07 Thread Wes Peng
I once had a file which is 100+GB getting computed in 3 nodes, each node 
has 24GB memory only. And the job could be done well. So from my 
experience spark cluster seems to work correctly for big files larger 
than memory by swapping them to disk.


Thanks

rajat kumar wrote:
Tested this with executors of size 5 cores, 17GB memory. Data vol is 
really high around 1TB


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Executorlost failure

2022-04-07 Thread Wes Peng

how many executors do you have?

rajat kumar wrote:
Tested this with executors of size 5 cores, 17GB memory. Data vol is 
really high around 1TB


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



query time comparison to several SQL engines

2022-04-07 Thread Wes Peng
I made a simple test to query time for several SQL engines including 
mysql, hive, drill and spark. The report,


https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf

It maybe have no special meaning, just for fun. :)

regards.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Profiling spark application

2022-01-19 Thread Wes Peng

Give a look at this:
https://github.com/LucaCanali/sparkMeasure

On 2022/1/20 1:18, Prasad Bhalerao wrote:
Is there any way we can profile spark applications which will show no. 
of invocations of spark api and their execution time etc etc just the 
way jprofiler shows all the details?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Pyspark] How to download Zip file from SFTP location and put in into Azure Data Lake and unzip it

2022-01-18 Thread Wes Peng
How large is the file? From my experience, reading the excel file from 
data lake and loading as dataframe, works great.


Thanks

On 2022-01-18 22:16, Heta Desai wrote:

Hello,

 I have zip files on SFTP location. I want to download/copy those
files and put into Azure Data Lake. Once the zip files get stored into
Azure Data Lake, I want to unzip those files and read using Data
Frames.

 The file format inside zip is excel. SO, once files are unzipped, I
want to read excel files using spark DataFrames.

 Please help me with the solution as soon as possible.

 Thanks,

 ​Heta Desai | Data | Sr Associate L1
e.heta.de...@1rivet.com | t. +91 966.225.4954

 ​

 This email, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If you are not the intended recipient,
please advise the sender immediately and delete this message and any
attachments. Unless otherwise specifically stated in this email,
transaction related information in this email, including attachments,
is not to be construed as an offer, solicitation or the basis or
confirmation for any contract for the purchase/sale of any services.
All email sent to or from this address will be received by 1Rivet US,
Inc and is subject to archival retention and review by someone other
than the recipient.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ivy unit test case filing for Spark

2021-12-21 Thread Wes Peng
Are you using IvyVPN which causes this problem? If the VPN software changes
the network URL silently you should avoid using them.

Regards.

On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar 
wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>