[jira] [Updated] (SPARK-19623) Take rows from DataFrame with empty first partition

Jaeboo Jung (JIRA) Wed, 15 Feb 2017 23:48:02 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jaeboo Jung updated SPARK-19623:
--------------------------------
    Description: 
I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
having a empty first partition, DataFrame and its RDD have different behaviors 
during taking rows from it. If we take only 1000 rows from DataFrame, it causes 
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 100000000,1000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor memory.

  was:
I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
having a empty first partition, DataFrame and its RDD have different behaviors 
during taking rows from it. If we take only 1000 rows from DataFrame, it causes 
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 100000000,1000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}


> Take rows from DataFrame with empty first partition
> ---------------------------------------------------
>
>                 Key: SPARK-19623
>                 URL: https://issues.apache.org/jira/browse/SPARK-19623
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.2
>            Reporter: Jaeboo Jung
>            Priority: Minor
>
> I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
> having a empty first partition, DataFrame and its RDD have different 
> behaviors during taking rows from it. If we take only 1000 rows from 
> DataFrame, it causes OOME but RDD is OK.
> In detail,
> DataFrame without a empty first partition => OK
> DataFrame with a empty first partition => OOME
> RDD of DataFrame with a empty first partition => OK
> Codes below reproduce this error.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(1 to 100000000,1000).map(i => 
> Row.fromSeq(Array.fill(100)(i)))
> val schema = StructType(for(i <- 1 to 100) yield {
> StructField("COL"+i,IntegerType, true)
> })
> val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
> Iterator[Row]() else iter)
> val df1 = sqlContext.createDataFrame(rdd,schema)
> df1.take(1000) // OK
> val df2 = sqlContext.createDataFrame(rdd2,schema)
> df2.rdd.take(1000) // OK
> df2.take(1000) // OOME
> {code}
> I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19623) Take rows from DataFrame with empty first partition

Reply via email to