Detecting configuration problems

2015-09-06 Thread Madhu
I'm not sure if this has been discussed already, if so, please point me to
the thread and/or related JIRA.

I have been running with about 1TB volume on a 20 node D2 cluster (255
GiB/node).
I have uniformly distributed data, so skew is not a problem.

I found that default settings (or wrong setting) for driver and executor
memory caused out of memory exceptions during shuffle (subtractByKey to be
exact). This was not easy to track down, for me at least.

Once I bumped up driver to 12G and executor to 10G with 300 executors and
3000 partitions, shuffle worked quite well (12 mins for subtractByKey). I'm
sure there are more improvement to made, but it's a lot better than heap
space exceptions!

>From my reading, the shuffle OOM problem is in ExternalAppendOnlyMap or
similar disk backed collection.
I have some familiarity with that code based on previous work with external
sorting.

Is it possible to detect misconfiguration that leads to these OOMs and
produce a more meaningful error messages? I think that would really help
users who might not understand all the inner workings and configuration of
Spark (myself included). As it is, heap space issues are a challenge and
does not present Spark in a positive light.

I can help with that effort if someone is willing to point me to the precise
location of memory pressure during shuffle.

Thanks!



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Exception in saving MatrixFactorizationModel

2015-09-06 Thread Ranjana Rajendran
It looks like you hit https://issues.apache.org/jira/browse/SPARK-7837 .
As I understand this occurs if there is skew in unpartitioned data.

Can you try partitioning model before saving it ?

On Sat, Sep 5, 2015 at 11:16 PM, Madawa Soysa 
wrote:

> outPath is correct. In the path, there are two directories data and
> metadata. In the data directory, following data structure is there.
>
> |-data
> |user
> |_temporary
> | 0
> |_temporary
>
> But nothing is written inside the folders. I'm using spark 1.4.1.
>
> On 6 September 2015 at 08:53, Yanbo Liang  wrote:
>
>> Please check the "outPath" and verify whether the saving succeed.
>> Which version did you use?
>> You may hit this issue  
>> which
>> is resolved at version 1.5.
>>
>> 2015-09-05 21:47 GMT+08:00 Madawa Soysa :
>>
>>> Hi All,
>>>
>>> I'm getting an error when trying to save a ALS MatrixFactorizationModel.
>>> I'm using following method to save the model.
>>>
>>> *model.save(sc, outPath)*
>>>
>>> I'm getting the following exception when saving the model. I have
>>> attached the full stack trace. Any help would be appreciated to resolve
>>> this issue.
>>>
>>> org.apache.spark.SparkException: Job aborted.
>>> at
>>> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:166)
>>> at
>>> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:139)
>>> at
>>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>>> at
>>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>>> at
>>> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:950)
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:950)
>>> at
>>> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:336)
>>> at
>>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
>>> at
>>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
>>> at
>>> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
>>> at
>>> org.apache.spark.mllib.recommendation.MatrixFactorizationModel$SaveLoadV1_0$.save(MatrixFactorizationModel.scala:284)
>>> at
>>> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.save(MatrixFactorizationModel.scala:141)
>>>
>>>
>>> Thanks,
>>> Madawa
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>
>
> --
>


Re: Exception in saving MatrixFactorizationModel

2015-09-06 Thread Madawa Soysa
Hi,

I'll try partitioning.

I have another question, after creating the MatrixFactorizationModel
through spark, can it be serialized as a Java object without any problem?

On 6 September 2015 at 22:39, Ranjana Rajendran  wrote:

> It looks like you hit https://issues.apache.org/jira/browse/SPARK-7837 .
> As I understand this occurs if there is skew in unpartitioned data.
>
> Can you try partitioning model before saving it ?
>
> On Sat, Sep 5, 2015 at 11:16 PM, Madawa Soysa 
> wrote:
>
>> outPath is correct. In the path, there are two directories data and
>> metadata. In the data directory, following data structure is there.
>>
>> |-data
>> |user
>> |_temporary
>> | 0
>> |_temporary
>>
>> But nothing is written inside the folders. I'm using spark 1.4.1.
>>
>> On 6 September 2015 at 08:53, Yanbo Liang  wrote:
>>
>>> Please check the "outPath" and verify whether the saving succeed.
>>> Which version did you use?
>>> You may hit this issue
>>>  which is resolved at
>>> version 1.5.
>>>
>>> 2015-09-05 21:47 GMT+08:00 Madawa Soysa :
>>>
 Hi All,

 I'm getting an error when trying to save a ALS
 MatrixFactorizationModel. I'm using following method to save the model.

 *model.save(sc, outPath)*

 I'm getting the following exception when saving the model. I have
 attached the full stack trace. Any help would be appreciated to resolve
 this issue.

 org.apache.spark.SparkException: Job aborted.
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:166)
 at
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:139)
 at
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
 at
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
 at
 org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
 at
 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:950)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:950)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:336)
 at
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
 at
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
 at
 org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
 at
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel$SaveLoadV1_0$.save(MatrixFactorizationModel.scala:284)
 at
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.save(MatrixFactorizationModel.scala:141)


 Thanks,
 Madawa




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>>
>>
>>
>> --
>>
>
>


RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread Cheng, Hao
Not sure if it’s too late, but we found a critical bug at 
https://issues.apache.org/jira/browse/SPARK-10466
UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle 
with data spill, this is not acceptable as it’s very common in a large table 
joins.

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Saturday, September 5, 2015 3:30 PM
To: Krishna Sankar
Cc: Davies Liu; Yin Huai; Tom Graves; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Thanks, Krishna, for the report. We should fix your problem using the Python 
UDFs in 1.6 too.

I'm going to close this vote now. Thanks everybody for voting. This vote passes 
with 8 +1 votes (3 binding) and no 0 or -1 votes.

+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee

0:

-1:


I will work on packaging this release in the next few days.



On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar 
mailto:ksanka...@gmail.com>> wrote:
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers


On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu 
mailto:dav...@databricks.com>> wrote:
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar 
mailto:ksanka...@gmail.com>> wrote:
> Yin,
>It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> 
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai 
> mailto:yh...@databricks.com>> wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
>> mailto:ksanka...@gmail.com>>
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> 
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves 
>>> mailto:tgraves...@yahoo.com>> wrote:

 The upper/lower case thing is known.
 https://issues.apache.org/jira/browse/SPARK-9550
 I assume it was decided to be ok and its going to be in the release
 notes  but Reynold or Josh can probably speak to it more.

 Tom



 On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
 mailto:ksanka...@gmail.com>> wrote:


 +?

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word
 count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql("SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
 OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. All joins,sql,set operations,udf OK

 Two Problems:

 1. The synthetic column names are lowercase ( i.e. now
 ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
 previously 'AVG(Total)'). So programs that depend on the case of the
 synthetic column names would fail.
 2. orders_3.groupBy("Year","Month").sum('Total').show()
 fails with the error ‘java.io.IOException: Unable to acquire 4194304
 bytes of memory’
 orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
 with the same error
 Is this a known bug ?
 Cheers
 
 P.S: Sorry for the spam, forgot Reply All

 On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin 
 mailto:r...@databricks.com>> wrote:

>>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread james
I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
corresponding description in
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
there are only 'sort' and 'hash' two options).



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13984.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org