Profiling memory use and access

2016-04-24 Thread Edmon Begoli
I am working on an experimental research into memory use and profiling of
memory use and allocation by machine learning functions across number of
popular libraries.

Is there a facility within Spark, and MLlib specifically to track the
allocation and use of data frames/memory by MLlib?

Please advise.

I will acknowledge any contributions in a paper, or add you as a co-author
if you have any significant contribution (and if interested).

Thank you,
Edmon


When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Edmon Begoli
I am needing this fact for the research paper I am writing right now.

When did Spark start supporting Parquet and when ORC?
(what release)

I appreciate any info you can offer.

Thank you,
Edmon


Small-cluster deployment modes

2015-07-24 Thread Edmon Begoli
Hey folks,

I am wanting to setup a single machine or a small cluster machine to
run our Spark based exploration lab.

Does anyone have suggestions or metrics on feasibility of running
Spark standalone on a good size RAM machine (64GB) with SSDs without
resource manager.

I expect on or two users at the time mostly running SQL or MLlib jobs.

Also - any recommendations for the hardware?

Thank you,
Edmon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reducing Spark's logging verbosity

2015-03-21 Thread Edmon Begoli
Hi,
Does anyone have concrete recommendations how to reduce Spark's logging
verbosity.

We have attempted on several occasions to address this by setting various
log4j properties, both in configuration property files and in
$SPARK_HOME/conf/ spark-env.sh; however, all of those attempts have failed.

Any suggestions are welcome.

Thank you,
Edmon


Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation

2015-03-13 Thread Edmon Begoli
All,

Does anyone have any reference to a publication or other, informal sources
(blogs, notes), showing
performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file
system (NFS).

I need this for formal performance research.

We are currently doing a research into this on a very specific, butique
machine, and we are seeing some controversial results.

For the purpose of literature survey and general comparison I would like to
see the findings that others have had. I know that general wisdom states
that Spark and HDFS should work the best because of the data locality
awareness.

Thank you,
*Edmon Begoli, PhD*
Chief Data Officer
Joint Institute for Computational Sciences (JICS)
ebeg...@tennessee.edu
https://www.linkedin.com/in/ebegoli


Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Edmon Begoli
 Does Spark-SQL require installation of Hive for it to run correctly or not?

I could not tell from this statement:
https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon