Profiling memory use and access
I am working on an experimental research into memory use and profiling of memory use and allocation by machine learning functions across number of popular libraries. Is there a facility within Spark, and MLlib specifically to track the allocation and use of data frames/memory by MLlib? Please advise. I will acknowledge any contributions in a paper, or add you as a co-author if you have any significant contribution (and if interested). Thank you, Edmon
When did Spark started supporting ORC and Parquet?
I am needing this fact for the research paper I am writing right now. When did Spark start supporting Parquet and when ORC? (what release) I appreciate any info you can offer. Thank you, Edmon
Small-cluster deployment modes
Hey folks, I am wanting to setup a single machine or a small cluster machine to run our Spark based exploration lab. Does anyone have suggestions or metrics on feasibility of running Spark standalone on a good size RAM machine (64GB) with SSDs without resource manager. I expect on or two users at the time mostly running SQL or MLlib jobs. Also - any recommendations for the hardware? Thank you, Edmon - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Reducing Spark's logging verbosity
Hi, Does anyone have concrete recommendations how to reduce Spark's logging verbosity. We have attempted on several occasions to address this by setting various log4j properties, both in configuration property files and in $SPARK_HOME/conf/ spark-env.sh; however, all of those attempts have failed. Any suggestions are welcome. Thank you, Edmon
Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation
All, Does anyone have any reference to a publication or other, informal sources (blogs, notes), showing performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file system (NFS). I need this for formal performance research. We are currently doing a research into this on a very specific, butique machine, and we are seeing some controversial results. For the purpose of literature survey and general comparison I would like to see the findings that others have had. I know that general wisdom states that Spark and HDFS should work the best because of the data locality awareness. Thank you, *Edmon Begoli, PhD* Chief Data Officer Joint Institute for Computational Sciences (JICS) ebeg...@tennessee.edu https://www.linkedin.com/in/ebegoli
Spark-SQL and Hive - is Hive required?
Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon