[Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-28 Thread Zhang, Yuqi
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]


--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com


This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.




[GraphX] - OOM Java Heap Space

2018-10-28 Thread Thodoris Zois
Hello,

I have the edges of a graph stored as parquet files (about 3GB). I am loading 
the graph and trying to compute the total number of triplets and triangles. 
Here is my code:

val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + 
"/year=" + year) 
val edges: RDD[Edge[Int]] = edges_parq.rdd.map(row => 
Edge(row(0).asInstanceOf[Int].toInt, row(1).asInstanceOf[Int].toInt))
val graph = Graph.fromEdges(edges, 
1.toInt).partitionBy(PartitionStrategy.RandomVertexCut)

// The actual computation
var numberOfTriplets = graph.triplets.count
val tmp =  graph.triangleCount().vertices.filter{ case (vid, count) => count > 
0 }
var numberOfTriangles = tmp.map(a => a._2).sum()

Even though it manages to compute the number of triplets, I can’t compute the 
number of triangles. Every time I get an exception OOM - Java Heap Space on 
some executors and the application fails.
I am using 100 executors (1 core and 6GBs per executor). I have tried to use 
'hdfsConf.set("mapreduce.input.fileinputformat.split.maxsize", "33554432”)’ in 
the code but still no results.

Here are some of my configurations:
--conf spark.driver.memory=20G 
--conf spark.driver.maxResultSize=20G 
--conf spark.yarn.executor.memoryOverhead=6144 

- Thodoris
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Number of rows divided by rowsPerBlock cannot exceed maximum integer

2018-10-28 Thread Soheil Pourbafrani
Hi,
Doing cartesian multiplication against a matrix, I got the error:

pyspark.sql.utils.IllegalArgumentException: requirement failed: Number of
rows divided by rowsPerBlock cannot exceed maximum integer.

Here is the code:

normalizer = Normalizer(inputCol="feature", outputCol="norm")
data = normalizer.transform(tfidf)

mat = IndexedRowMatrix(
data.select("ID", "norm")\
.rdd.map(lambda row: IndexedRow(row.ID,
row.norm.toArray(.toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()

The error point to the line:

.rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray(.toBlockMatrix()

I reduce the data to only 5 sentences, but I still got the error!


Re: Processing Flexibility Between RDD and Dataframe API

2018-10-28 Thread Adrienne Kole
Thanks for bringing this issue to the mailing list.
As an addition, I would also ask the same questions about  DStreams and
Structured Streaming APIs.
Structured Streaming is high level and it makes difficult to express all
business logic in it, although Databricks are pushing it and recommending
for usage.
Moreover, there are some works are going on continuous streaming.
So, what is the Spark's future vision, support all or concentrate on one,
as all those paradigms have separate processing semantics?


Cheers,
Adrienne

On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani 
wrote:

> Hi,
> There are some functions like map, flatMap, reduce and ..., that construct
> the base data processing operation in big data (and Apache Spark). But
> Spark, in new versions, introduces the high-level Dataframe API and
> recommend using it. This is while there are no such functions in Dataframe
> API and it just has many built-in functions and the UDF. It's very
> inflexible (at least to me) and I at many points should convert
> Dataframes to RDD and vice-versa. My question is:
> Is RDD going to be outdated and if so, what is the correct road-map to do
> processing using Apache Spark, while Dataframe doesn't support functions
> like Map and reduce? How UDF functions process the data, they will apply to
> every row, like map functions? Are converting Dataframe to RDD comes with
> many costs?
>


Processing Flexibility Between RDD and Dataframe API

2018-10-28 Thread Soheil Pourbafrani
Hi,
There are some functions like map, flatMap, reduce and ..., that construct
the base data processing operation in big data (and Apache Spark). But
Spark, in new versions, introduces the high-level Dataframe API and
recommend using it. This is while there are no such functions in Dataframe
API and it just has many built-in functions and the UDF. It's very
inflexible (at least to me) and I at many points should convert
Dataframes to RDD and vice-versa. My question is:
Is RDD going to be outdated and if so, what is the correct road-map to do
processing using Apache Spark, while Dataframe doesn't support functions
like Map and reduce? How UDF functions process the data, they will apply to
every row, like map functions? Are converting Dataframe to RDD comes with
many costs?