date:20170113

[jira] [Created] (SPARK-19206) Update outdated parameter descripions

2017-01-13 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19206:
-

 Summary: Update outdated parameter descripions
 Key: SPARK-19206
 URL: https://issues.apache.org/jira/browse/SPARK-19206
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-13 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19206:
--
Summary: Update outdated parameter descriptions in external-kafka module  
(was: Update outdated parameter descripions in external-kafka module)

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19206) Update outdated parameter descripions in external-kafka module

2017-01-13 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19206:
--
Summary: Update outdated parameter descripions in external-kafka module  
(was: Update outdated parameter descripions)

> Update outdated parameter descripions in external-kafka module
> --
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19206:


Assignee: (was: Apache Spark)

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821410#comment-15821410
 ] 

Apache Spark commented on SPARK-19206:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16569

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19206) Update outdated parameter descriptions in external-kafka module

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19206:


Assignee: Apache Spark

> Update outdated parameter descriptions in external-kafka module
> ---
>
> Key: SPARK-19206
> URL: https://issues.apache.org/jira/browse/SPARK-19206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-13 Thread Tsuyoshi Ozawa (JIRA)

Tsuyoshi Ozawa created SPARK-19207:
--

 Summary: LocalSparkSession should use Slf4JLoggerFactory.INSTANCE 
instead of creating new object via constructor
 Key: SPARK-19207
 URL: https://issues.apache.org/jira/browse/SPARK-19207
 Project: Spark
  Issue Type: Improvement
Reporter: Tsuyoshi Ozawa


It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
warning is generated:
{code}
[warn] 
/Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
 constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: see 
corresponding Javadoc for more information.
[warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-13 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821417#comment-15821417
 ] 

Tsuyoshi Ozawa commented on SPARK-19207:


I will send PR to fix this problem

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821467#comment-15821467
 ] 

Apache Spark commented on SPARK-19207:
--

User 'oza' has created a pull request for this issue:
https://github.com/apache/spark/pull/16570

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19207:


Assignee: (was: Apache Spark)

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19207) LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating new object via constructor

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19207:


Assignee: Apache Spark

> LocalSparkSession should use Slf4JLoggerFactory.INSTANCE instead of creating 
> new object via constructor
> ---
>
> Key: SPARK-19207
> URL: https://issues.apache.org/jira/browse/SPARK-19207
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tsuyoshi Ozawa
>Assignee: Apache Spark
>
> It's deprecated to create Slf4JLoggerFactory's instance via constructor. A 
> warning is generated:
> {code}
> [warn] 
> /Users/ozawa/workspace/spark/sql/core/src/test/scala/org/apache/spark/sql/LocalSparkSession.scala:32:
>  constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: 
> see corresponding Javadoc for more information.
> [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Ben (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821506#comment-15821506
 ] 

Ben commented on SPARK-18667:
-

So, I created a new example now, and here is the code for everything:

a.xml:
{noformat}

  TEXT
  TEXT2

{noformat}

b.xml:
{noformat}

  file:/C:/a.xml
  AAA

{noformat}

code:
{noformat}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('../../res/Other/a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()
df.select(sameText(df['file'])).show()

df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root')
df3 = df.join(df2, 'file')

df.show()
df2.show()
df3.show()
df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show()
{noformat}

and this is the console output:
{noformat}
2017-01-13 10:27:55 WARN   org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
++
|file|
++
|file:/C:/Users/SS...|
++

+--+
|filename(file)|
+--+
|  |
+--+

++-++
|   x|y|file|
++-++
|TEXT|TEXT2|file:/C:/Users/SS...|
++-++

++-+
|file|other|
++-+
|file:/C:/Users/SS...|  AAA|
++-+

+++-+-+
|file|   x|y|other|
+++-+-+
|file:/C:/Users/SS...|TEXT|TEXT2|  AAA|
+++-+-+


[Stage 26:> (0 + 4) / 4]


[Stage 29:>(0 + 8) / 20]
[Stage 29:=>   (6 + 8) / 20]
[Stage 29:===> (7 + 8) / 20]
[Stage 29:==>  (8 + 8) / 20]
[Stage 29:>   (10 + 8) / 20]
[Stage 29:>   (13 + 7) / 20]
[Stage 29:===>(14 + 6) / 20]
[Stage 29:==> (15 + 5) / 20]


[Stage 32:>   (0 + 8) / 100]
[Stage 32:===>(7 + 8) / 100]
[Stage 32:>   (8 + 8) / 100]
[Stage 32:===>   (13 + 8) / 100]
[Stage 32:>  (15 + 8) / 100]
[Stage 32:===>   (20 + 8) / 100]
[Stage 32:>  (22 + 8) / 100]
[Stage 32:==>(27 + 8) / 100]
[Stage 32:===>   (29 + 8) / 100]
[Stage 32:==>(34 + 8) / 100]
[Stage 32:===>   (36 + 8) / 100]
[Stage 32:==>(41 + 8) / 100]
[Stage 32:===>   (42 + 8) / 100]
[Stage 32:=> (46 + 8) / 100]
[Stage 32:==>(48 + 8) / 100]
[Stage 32:==>(49 + 8) / 100]
[Stage 32:===>   (50 + 8) / 100]
[Stage 32:=> (53 + 8) / 100]
[Stage 32:==>(55 + 8) / 100]
[Stage 32:==>(56 + 8) / 100]
[Stage 32:===>   (57 + 8) / 100]
[Stage 32:=> (60 + 8) / 100]
[Stage 32:==>(62 + 8) / 100]
[Stage 32:==>(63 + 8) / 100]
[Stage 32:===>   (65 + 8) / 1

[jira] [Comment Edited] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Ben (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821506#comment-15821506
 ] 

Ben edited comment on SPARK-18667 at 1/13/17 9:53 AM:
--

So, I created a new example now, and here is the code for everything:

a.xml:
{noformat}

  TEXT
  TEXT2

{noformat}

b.xml:
{noformat}

  file:/C:/a.xml
  AAA

{noformat}

code:
{noformat}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('../../res/Other/a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()
df.select(sameText(df['file'])).show()

df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root')
df3 = df.join(df2, 'file')

df.show()
df2.show()
df3.show()
df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show()
{noformat}

and this is the console output:
{noformat}
++
|file|
++
|file:/C:/Users/SS...|
++

+--+
|filename(file)|
+--+
|  |
+--+

++-++
|   x|y|file|
++-++
|TEXT|TEXT2|file:/C:/Users/SS...|
++-++

++-+
|file|other|
++-+
|file:/C:/Users/SS...|  AAA|
++-+

+++-+-+
|file|   x|y|other|
+++-+-+
|file:/C:/Users/SS...|TEXT|TEXT2|  AAA|
+++-+-+


[Stage 26:> (0 + 4) / 4]


[Stage 29:>(0 + 8) / 20]
[Stage 29:=>   (6 + 8) / 20]
[Stage 29:===> (7 + 8) / 20]
[Stage 29:==>  (8 + 8) / 20]
[Stage 29:>   (10 + 8) / 20]
[Stage 29:>   (13 + 7) / 20]
[Stage 29:===>(14 + 6) / 20]
[Stage 29:==> (15 + 5) / 20]


[Stage 32:>   (0 + 8) / 100]
[Stage 32:===>(7 + 8) / 100]
[Stage 32:>   (8 + 8) / 100]
[Stage 32:===>   (13 + 8) / 100]
[Stage 32:>  (15 + 8) / 100]
[Stage 32:===>   (20 + 8) / 100]
[Stage 32:>  (22 + 8) / 100]
[Stage 32:==>(27 + 8) / 100]
[Stage 32:===>   (29 + 8) / 100]
[Stage 32:==>(34 + 8) / 100]
[Stage 32:===>   (36 + 8) / 100]
[Stage 32:==>(41 + 8) / 100]
[Stage 32:===>   (42 + 8) / 100]
[Stage 32:=> (46 + 8) / 100]
[Stage 32:==>(48 + 8) / 100]
[Stage 32:==>(49 + 8) / 100]
[Stage 32:===>   (50 + 8) / 100]
[Stage 32:=> (53 + 8) / 100]
[Stage 32:==>(55 + 8) / 100]
[Stage 32:==>(56 + 8) / 100]
[Stage 32:===>   (57 + 8) / 100]
[Stage 32:=> (60 + 8) / 100]
[Stage 32:==>(62 + 8) / 100]
[Stage 32:==>(63 + 8) / 100]
[Stage 32:===>   (65 + 8) / 100]
[Stage 32:>  (67 + 8) / 100]
[Stage 32:

[jira] [Created] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-19208:


 Summary: MaxAbsScaler and MinMaxScaler are very inefficient
 Key: SPARK-19208
 URL: https://issues.apache.org/jira/browse/SPARK-19208
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng


Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 748401 instances,   and 3,000,000 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Attachment: WechatIMG2621.jpeg

OOM

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 748401 instances,   and 3,000,000 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Description: 
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

  was:
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 748401 instances,   and 3,000,000 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)


> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19208:


Assignee: Apache Spark

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821549#comment-15821549
 ] 

Apache Spark commented on SPARK-19208:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16571

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Description: 
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

After modication in the pr, the above example run successfully.

  was:
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)


> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Description: 
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays:
{code}
private var currMean: Array[Double] = _
  private var currM2n: Array[Double] = _
  private var currM2: Array[Double] = _
  private var currL1: Array[Double] = _
  private var totalCnt: Long = 0
  private var totalWeightSum: Double = 0.0
  private var weightSquareSum: Double = 0.0
  private var weightSum: Array[Double] = _
  private var nnz: Array[Long] = _
  private var currMax: Array[Double] = _
  private var currMin: Array[Double] = _
{code}

For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

After modication in the pr, the above example run successfully.

  was:
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays
For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

After modication in the pr, the above example run successfully.


> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fail because OOM
> {{MultivariateOnlineSummarizer}} maintain 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only one array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-19208:
-
Description: 
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fails because of OOM

{{MultivariateOnlineSummarizer}} maintains 8 arrays:
{code}
private var currMean: Array[Double] = _
  private var currM2n: Array[Double] = _
  private var currM2: Array[Double] = _
  private var currL1: Array[Double] = _
  private var totalCnt: Long = 0
  private var totalWeightSum: Double = 0.0
  private var weightSquareSum: Double = 0.0
  private var weightSum: Array[Double] = _
  private var nnz: Array[Long] = _
  private var currMax: Array[Double] = _
  private var currMin: Array[Double] = _
{code}

For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

After modication in the pr, the above example run successfully.

  was:
Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
{{MultivariateOnlineSummarizer}} to compute the min/max.
However {{MultivariateOnlineSummarizer}} will also compute extra unused 
statistics. It slows down the task, moreover it is more prone to cause OOM.

For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: 
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
 748401 instances,   and 29,890,095 features
{{MaxAbsScaler.fit}} fail because OOM

{{MultivariateOnlineSummarizer}} maintain 8 arrays:
{code}
private var currMean: Array[Double] = _
  private var currM2n: Array[Double] = _
  private var currM2: Array[Double] = _
  private var currL1: Array[Double] = _
  private var totalCnt: Long = 0
  private var totalWeightSum: Double = 0.0
  private var weightSquareSum: Double = 0.0
  private var weightSum: Array[Double] = _
  private var nnz: Array[Long] = _
  private var currMax: Array[Double] = _
  private var currMin: Array[Double] = _
{code}

For {{MaxAbsScaler}}, only one array is needed (max of abs value)
For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)

After modication in the pr, the above example run successfully.


> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-19209:
--

 Summary: "No suitable driver" on first try
 Key: SPARK-19209
 URL: https://issues.apache.org/jira/browse/SPARK-19209
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Daniel Darabos


This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821578#comment-15821578
 ] 

Daniel Darabos commented on SPARK-19209:


Puzzlingly this only happens in the application when the SparkSession is 
created with {{enableHiveSupport}}. I guess in {{spark-shell}} it is enabled by 
default.

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-19209:
---
Description: 
This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
--driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
--driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.

  was:
This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.


> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>

[jira] [Commented] (SPARK-19065) Bad error when using dropDuplicates in Streaming

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821658#comment-15821658
 ] 

Apache Spark commented on SPARK-19065:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16564

> Bad error when using dropDuplicates in Streaming
> 
>
> Key: SPARK-19065
> URL: https://issues.apache.org/jira/browse/SPARK-19065
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Right now if you use .dropDuplicates in a stream you get a confusing 
> exception.
> Here is an example:
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> accountName#34351,eventSource#34331,resources#34339,eventType#34333,readOnly#34335,date#34350,errorCode#34327,errorMessage#34328,userAgent#34344,eventVersion#34334,eventTime#34332,recipientAccountId#34336,sharedEventID#34341,timing#34349,apiVersion#34325,additionalEventData#34324,requestParameters#34338,sourceIPAddress#34342,serviceEventDetails#34343,timestamp#34323,awsRegion#34326,eventName#34330,responseElements#34340,filename#34347,requestID#34337,vpcEndpointId#34346,line#34348,userIdentity#34345
>  missing from 
> requestID#34119,eventSource#34113,serviceEventDetails#34125,eventVersion#34116,userIdentity#34127,requestParameters#34120,accountName#34133,apiVersion#34107,eventTime#34114,additionalEventData#34106,line#34130,readOnly#34117,sourceIPAddress#34124,eventID#34329,errorCode#34109,resources#34121,timing#34131,userAgent#34126,eventType#34115,recipientAccountId#34118,errorMessage#34110,vpcEndpointId#34128,sharedEventID#34123,filename#34129,awsRegion#34108,responseElements#34122,date#34132,timestamp#34105,eventName#34112
>  in operator !Project [timestamp#34323, additionalEventData#34324, 
> apiVersion#34325, awsRegion#34326, errorCode#34327, errorMessage#34328, 
> eventID#34329, eventName#34330, eventSource#34331, eventTime#34332, 
> eventType#34333, eventVersion#34334, readOnly#34335, 
> recipientAccountId#34336, requestID#34337, requestParameters#34338, 
> resources#34339, responseElements#34340, sharedEventID#34341, 
> sourceIPAddress#34342, serviceEventDetails#34343, userAgent#34344, 
> userIdentity#34345, vpcEndpointId#34346, ... 5 more fields];;
> !Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
> +- Aggregate [eventID#34329], [first(timestamp#34323, false) AS 
> timestamp#34105, first(additionalEventData#34324, false) AS 
> additionalEventData#34106, first(apiVersion#34325, false) AS 
> apiVersion#34107, first(awsRegion#34326, false) AS awsRegion#34108, 
> first(errorCode#34327, false) AS errorCode#34109, first(errorMessage#34328, 
> false) AS errorMessage#34110, eventID#34329, first(eventName#34330, false) AS 
> eventName#34112, first(eventSource#34331, false) AS eventSource#34113, 
> first(eventTime#34332, false) AS eventTime#34114, first(eventType#34333, 
> false) AS eventType#34115, first(eventVersion#34334, false) AS 
> eventVersion#34116, first(readOnly#34335, false) AS readOnly#34117, 
> first(recipientAccountId#34336, false) AS recipientAccountId#34118, 
> first(requestID#34337, false) AS requestID#34119, 
> first(requestParameters#34338, false) AS requestParameters#34120, 
> first(resources#34339, false) AS resources#34121, 
> first(responseElements#34340, false) AS responseElements#34122, 
> first(sharedEventID#34341, false) AS sharedEventID#34123, 
> first(sourceIPAddress#34342, false) AS sourceIPAddress#34124, 
> first(serviceEventDetails#34343, false) AS serviceEventDetails#34125, 
> first(userAgent#34344, false) AS userAgent#34126, first(userIdentity#34345, 
> false) AS userIdentity#34127, first(vpcEndpointId#34346, false) AS 
> vpcEndpointId#34128, ... 5 more fields]
>+- Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
>   +- 
> Rel

[jira] [Assigned] (SPARK-19065) Bad error when using dropDuplicates in Streaming

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19065:


Assignee: Apache Spark

> Bad error when using dropDuplicates in Streaming
> 
>
> Key: SPARK-19065
> URL: https://issues.apache.org/jira/browse/SPARK-19065
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Right now if you use .dropDuplicates in a stream you get a confusing 
> exception.
> Here is an example:
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> accountName#34351,eventSource#34331,resources#34339,eventType#34333,readOnly#34335,date#34350,errorCode#34327,errorMessage#34328,userAgent#34344,eventVersion#34334,eventTime#34332,recipientAccountId#34336,sharedEventID#34341,timing#34349,apiVersion#34325,additionalEventData#34324,requestParameters#34338,sourceIPAddress#34342,serviceEventDetails#34343,timestamp#34323,awsRegion#34326,eventName#34330,responseElements#34340,filename#34347,requestID#34337,vpcEndpointId#34346,line#34348,userIdentity#34345
>  missing from 
> requestID#34119,eventSource#34113,serviceEventDetails#34125,eventVersion#34116,userIdentity#34127,requestParameters#34120,accountName#34133,apiVersion#34107,eventTime#34114,additionalEventData#34106,line#34130,readOnly#34117,sourceIPAddress#34124,eventID#34329,errorCode#34109,resources#34121,timing#34131,userAgent#34126,eventType#34115,recipientAccountId#34118,errorMessage#34110,vpcEndpointId#34128,sharedEventID#34123,filename#34129,awsRegion#34108,responseElements#34122,date#34132,timestamp#34105,eventName#34112
>  in operator !Project [timestamp#34323, additionalEventData#34324, 
> apiVersion#34325, awsRegion#34326, errorCode#34327, errorMessage#34328, 
> eventID#34329, eventName#34330, eventSource#34331, eventTime#34332, 
> eventType#34333, eventVersion#34334, readOnly#34335, 
> recipientAccountId#34336, requestID#34337, requestParameters#34338, 
> resources#34339, responseElements#34340, sharedEventID#34341, 
> sourceIPAddress#34342, serviceEventDetails#34343, userAgent#34344, 
> userIdentity#34345, vpcEndpointId#34346, ... 5 more fields];;
> !Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
> +- Aggregate [eventID#34329], [first(timestamp#34323, false) AS 
> timestamp#34105, first(additionalEventData#34324, false) AS 
> additionalEventData#34106, first(apiVersion#34325, false) AS 
> apiVersion#34107, first(awsRegion#34326, false) AS awsRegion#34108, 
> first(errorCode#34327, false) AS errorCode#34109, first(errorMessage#34328, 
> false) AS errorMessage#34110, eventID#34329, first(eventName#34330, false) AS 
> eventName#34112, first(eventSource#34331, false) AS eventSource#34113, 
> first(eventTime#34332, false) AS eventTime#34114, first(eventType#34333, 
> false) AS eventType#34115, first(eventVersion#34334, false) AS 
> eventVersion#34116, first(readOnly#34335, false) AS readOnly#34117, 
> first(recipientAccountId#34336, false) AS recipientAccountId#34118, 
> first(requestID#34337, false) AS requestID#34119, 
> first(requestParameters#34338, false) AS requestParameters#34120, 
> first(resources#34339, false) AS resources#34121, 
> first(responseElements#34340, false) AS responseElements#34122, 
> first(sharedEventID#34341, false) AS sharedEventID#34123, 
> first(sourceIPAddress#34342, false) AS sourceIPAddress#34124, 
> first(serviceEventDetails#34343, false) AS serviceEventDetails#34125, 
> first(userAgent#34344, false) AS userAgent#34126, first(userIdentity#34345, 
> false) AS userIdentity#34127, first(vpcEndpointId#34346, false) AS 
> vpcEndpointId#34128, ... 5 more fields]
>+- Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
>   +- 
> Relation[timestamp#34323,additionalEventData#34324,apiVersion#34325,awsRegion#34326,errorCod

[jira] [Assigned] (SPARK-19065) Bad error when using dropDuplicates in Streaming

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19065:


Assignee: (was: Apache Spark)

> Bad error when using dropDuplicates in Streaming
> 
>
> Key: SPARK-19065
> URL: https://issues.apache.org/jira/browse/SPARK-19065
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Right now if you use .dropDuplicates in a stream you get a confusing 
> exception.
> Here is an example:
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> accountName#34351,eventSource#34331,resources#34339,eventType#34333,readOnly#34335,date#34350,errorCode#34327,errorMessage#34328,userAgent#34344,eventVersion#34334,eventTime#34332,recipientAccountId#34336,sharedEventID#34341,timing#34349,apiVersion#34325,additionalEventData#34324,requestParameters#34338,sourceIPAddress#34342,serviceEventDetails#34343,timestamp#34323,awsRegion#34326,eventName#34330,responseElements#34340,filename#34347,requestID#34337,vpcEndpointId#34346,line#34348,userIdentity#34345
>  missing from 
> requestID#34119,eventSource#34113,serviceEventDetails#34125,eventVersion#34116,userIdentity#34127,requestParameters#34120,accountName#34133,apiVersion#34107,eventTime#34114,additionalEventData#34106,line#34130,readOnly#34117,sourceIPAddress#34124,eventID#34329,errorCode#34109,resources#34121,timing#34131,userAgent#34126,eventType#34115,recipientAccountId#34118,errorMessage#34110,vpcEndpointId#34128,sharedEventID#34123,filename#34129,awsRegion#34108,responseElements#34122,date#34132,timestamp#34105,eventName#34112
>  in operator !Project [timestamp#34323, additionalEventData#34324, 
> apiVersion#34325, awsRegion#34326, errorCode#34327, errorMessage#34328, 
> eventID#34329, eventName#34330, eventSource#34331, eventTime#34332, 
> eventType#34333, eventVersion#34334, readOnly#34335, 
> recipientAccountId#34336, requestID#34337, requestParameters#34338, 
> resources#34339, responseElements#34340, sharedEventID#34341, 
> sourceIPAddress#34342, serviceEventDetails#34343, userAgent#34344, 
> userIdentity#34345, vpcEndpointId#34346, ... 5 more fields];;
> !Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
> +- Aggregate [eventID#34329], [first(timestamp#34323, false) AS 
> timestamp#34105, first(additionalEventData#34324, false) AS 
> additionalEventData#34106, first(apiVersion#34325, false) AS 
> apiVersion#34107, first(awsRegion#34326, false) AS awsRegion#34108, 
> first(errorCode#34327, false) AS errorCode#34109, first(errorMessage#34328, 
> false) AS errorMessage#34110, eventID#34329, first(eventName#34330, false) AS 
> eventName#34112, first(eventSource#34331, false) AS eventSource#34113, 
> first(eventTime#34332, false) AS eventTime#34114, first(eventType#34333, 
> false) AS eventType#34115, first(eventVersion#34334, false) AS 
> eventVersion#34116, first(readOnly#34335, false) AS readOnly#34117, 
> first(recipientAccountId#34336, false) AS recipientAccountId#34118, 
> first(requestID#34337, false) AS requestID#34119, 
> first(requestParameters#34338, false) AS requestParameters#34120, 
> first(resources#34339, false) AS resources#34121, 
> first(responseElements#34340, false) AS responseElements#34122, 
> first(sharedEventID#34341, false) AS sharedEventID#34123, 
> first(sourceIPAddress#34342, false) AS sourceIPAddress#34124, 
> first(serviceEventDetails#34343, false) AS serviceEventDetails#34125, 
> first(userAgent#34344, false) AS userAgent#34126, first(userIdentity#34345, 
> false) AS userIdentity#34127, first(vpcEndpointId#34346, false) AS 
> vpcEndpointId#34128, ... 5 more fields]
>+- Project [timestamp#34323, additionalEventData#34324, apiVersion#34325, 
> awsRegion#34326, errorCode#34327, errorMessage#34328, eventID#34329, 
> eventName#34330, eventSource#34331, eventTime#34332, eventType#34333, 
> eventVersion#34334, readOnly#34335, recipientAccountId#34336, 
> requestID#34337, requestParameters#34338, resources#34339, 
> responseElements#34340, sharedEventID#34341, sourceIPAddress#34342, 
> serviceEventDetails#34343, userAgent#34344, userIdentity#34345, 
> vpcEndpointId#34346, ... 5 more fields]
>   +- 
> Relation[timestamp#34323,additionalEventData#34324,apiVersion#34325,awsRegion#34326,errorCode#34327,errorMessage#3432

[jira] [Updated] (SPARK-19189) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19189:
---
Issue Type: Improvement  (was: Bug)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19203) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19203:
---
Priority: Minor  (was: Major)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19203
> URL: https://issues.apache.org/jira/browse/SPARK-19203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19190) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19190:
---
Priority: Minor  (was: Major)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19190
> URL: https://issues.apache.org/jira/browse/SPARK-19190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19190) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19190:
---
Issue Type: Improvement  (was: Bug)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19190
> URL: https://issues.apache.org/jira/browse/SPARK-19190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19189) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19189:
---
Priority: Minor  (was: Major)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18863) Output non-aggregate expressions without GROUP BY in a subquery does not yield an error

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821679#comment-15821679
 ] 

Apache Spark commented on SPARK-18863:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/16572

> Output non-aggregate expressions without GROUP BY in a subquery does not 
> yield an error 
> 
>
> Key: SPARK-18863
> URL: https://issues.apache.org/jira/browse/SPARK-18863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> [~smilegator] has found that the following query does not raise a syntax 
> error (note the GROUP BY clause is commented out):
> {code:sql}
> SELECT pk, cv
> FROM   p, c
> WHERE  p.pk = c.ck
> ANDc.cv = (SELECT max(avg)
>FROM   (SELECT   c1.cv, avg(c1.cv) avg
>FROM c c1
>WHEREc1.ck = p.pk
> --   GROUP BY c1.cv
>   ))
> {code}
> There could be multiple values of {{c1.cv}} for each value of {{avg(c1.cv)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18863) Output non-aggregate expressions without GROUP BY in a subquery does not yield an error

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18863:


Assignee: Apache Spark

> Output non-aggregate expressions without GROUP BY in a subquery does not 
> yield an error 
> 
>
> Key: SPARK-18863
> URL: https://issues.apache.org/jira/browse/SPARK-18863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>
> [~smilegator] has found that the following query does not raise a syntax 
> error (note the GROUP BY clause is commented out):
> {code:sql}
> SELECT pk, cv
> FROM   p, c
> WHERE  p.pk = c.ck
> ANDc.cv = (SELECT max(avg)
>FROM   (SELECT   c1.cv, avg(c1.cv) avg
>FROM c c1
>WHEREc1.ck = p.pk
> --   GROUP BY c1.cv
>   ))
> {code}
> There could be multiple values of {{c1.cv}} for each value of {{avg(c1.cv)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18863) Output non-aggregate expressions without GROUP BY in a subquery does not yield an error

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18863:


Assignee: (was: Apache Spark)

> Output non-aggregate expressions without GROUP BY in a subquery does not 
> yield an error 
> 
>
> Key: SPARK-18863
> URL: https://issues.apache.org/jira/browse/SPARK-18863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> [~smilegator] has found that the following query does not raise a syntax 
> error (note the GROUP BY clause is commented out):
> {code:sql}
> SELECT pk, cv
> FROM   p, c
> WHERE  p.pk = c.ck
> ANDc.cv = (SELECT max(avg)
>FROM   (SELECT   c1.cv, avg(c1.cv) avg
>FROM c c1
>WHEREc1.ck = p.pk
> --   GROUP BY c1.cv
>   ))
> {code}
> There could be multiple values of {{c1.cv}} for each value of {{avg(c1.cv)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19210) Add log level info into streaming checkpoint

2017-01-13 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19210:
-

 Summary: Add log level info into streaming checkpoint
 Key: SPARK-19210
 URL: https://issues.apache.org/jira/browse/SPARK-19210
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu


If we set log level by using {{SparkContext.setLogLevel}}, after restart 
streaming job from checkpoint data, the log level info will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19210) Add log level info into streaming checkpoint

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19210:


Assignee: (was: Apache Spark)

> Add log level info into streaming checkpoint
> 
>
> Key: SPARK-19210
> URL: https://issues.apache.org/jira/browse/SPARK-19210
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> If we set log level by using {{SparkContext.setLogLevel}}, after restart 
> streaming job from checkpoint data, the log level info will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19210) Add log level info into streaming checkpoint

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821689#comment-15821689
 ] 

Apache Spark commented on SPARK-19210:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16573

> Add log level info into streaming checkpoint
> 
>
> Key: SPARK-19210
> URL: https://issues.apache.org/jira/browse/SPARK-19210
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> If we set log level by using {{SparkContext.setLogLevel}}, after restart 
> streaming job from checkpoint data, the log level info will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19210) Add log level info into streaming checkpoint

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19210:


Assignee: Apache Spark

> Add log level info into streaming checkpoint
> 
>
> Key: SPARK-19210
> URL: https://issues.apache.org/jira/browse/SPARK-19210
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>
> If we set log level by using {{SparkContext.setLogLevel}}, after restart 
> streaming job from checkpoint data, the log level info will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert

2017-01-13 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-19211:


 Summary: Explicitly prevent Insert into View or Create View As 
Insert
 Key: SPARK-19211
 URL: https://issues.apache.org/jira/browse/SPARK-19211
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Jiang Xingbo


Currently we don't explicitly forbid the following behaviors:
1. The statement CREATE VIEW AS INSERT INTO throws the following exception from 
SQLBuilder:
`java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
MetastoreRelation default, tbl, false, false`;
2. The statement INSERT INTO view VALUES throws the following exception from 
checkAnalysis:
`Error in query: Inserting into an RDD-based table is not allowed.;;`

We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19212) Parse the view query in HiveSessionCatalog

2017-01-13 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-19212:


 Summary: Parse the view query in HiveSessionCatalog
 Key: SPARK-19212
 URL: https://issues.apache.org/jira/browse/SPARK-19212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Jiang Xingbo


Currently we parse the view query and generate the parsed plan in 
HiveMetastoreCatalog, we should really do that in HiveSessionCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19210) Add log level info into streaming checkpoint

2017-01-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19210:
--
Target Version/s:   (was: 2.0.3, 2.1.1)
Priority: Minor  (was: Major)
  Issue Type: Improvement  (was: Bug)

> Add log level info into streaming checkpoint
> 
>
> Key: SPARK-19210
> URL: https://issues.apache.org/jira/browse/SPARK-19210
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>
> If we set log level by using {{SparkContext.setLogLevel}}, after restart 
> streaming job from checkpoint data, the log level info will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18971:


Assignee: (was: Apache Spark)

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> Check https://github.com/netty/netty/issues/6153 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821707#comment-15821707
 ] 

Apache Spark commented on SPARK-18971:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16568

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> Check https://github.com/netty/netty/issues/6153 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18971:


Assignee: Apache Spark

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Check https://github.com/netty/netty/issues/6153 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19189) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821711#comment-15821711
 ] 

Apache Spark commented on SPARK-19189:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/16574

> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-19189:
---
Summary: Optimize CartesianRDD to avoid parent RDD partition re-computation 
and re-serialization  (was: Optimize CartesianRDD to avoid partition 
re-computation and re-serialization)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-19203) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang deleted SPARK-19203:



> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19203
> URL: https://issues.apache.org/jira/browse/SPARK-19203
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-19190) Optimize CartesianRDD to avoid partition re-computation and re-serialization

2017-01-13 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang deleted SPARK-19190:



> Optimize CartesianRDD to avoid partition re-computation and re-serialization
> 
>
> Key: SPARK-19190
> URL: https://issues.apache.org/jira/browse/SPARK-19190
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821728#comment-15821728
 ] 

Sean Owen commented on SPARK-19208:
---

You have 29,890,095 features. At extremes of scale this might make a 
difference, but, this isn't at all a normal use case. The difference between 3 
and 8 8-byte values per feature is, at any normal scale, trivial. I am not sure 
it's worth duplicating this code to optimize this. You can just write your own 
custom scaling for this extreme case, and, make it even more efficient.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Robert Kruszewski (JIRA)

Robert Kruszewski created SPARK-19213:
-

 Summary: FileSourceScanExec usese sparksession from 
hadoopfsrelation creation time instead of the one active at time of execution
 Key: SPARK-19213
 URL: https://issues.apache.org/jira/browse/SPARK-19213
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Robert Kruszewski


If you look at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
 you'll notice that the sparksession used for execution is the one that was 
captured from logicalplan. Whereas in other places you have 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
 and SparkPlan captures active session upon execution in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52

>From my understanding of the io code it would be beneficial to be able to use 
>the active session in order to be able to modify hadoop config without 
>recreating the dataset. What would be interesting is to not lock the spark 
>session in the physical plan for ios and let you share datasets across spark 
>sessions. Is that supposed to work? Otherwise you'd have to get a new query 
>execution to bind to new sparksession which would only let you share logical 
>plans. 

I am sending pr along with the latter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19213:


Assignee: (was: Apache Spark)

> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the io code it would be beneficial to be able to use 
> the active session in order to be able to modify hadoop config without 
> recreating the dataset. What would be interesting is to not lock the spark 
> session in the physical plan for ios and let you share datasets across spark 
> sessions. Is that supposed to work? Otherwise you'd have to get a new query 
> execution to bind to new sparksession which would only let you share logical 
> plans. 
> I am sending pr along with the latter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19213:


Assignee: Apache Spark

> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the io code it would be beneficial to be able to use 
> the active session in order to be able to modify hadoop config without 
> recreating the dataset. What would be interesting is to not lock the spark 
> session in the physical plan for ios and let you share datasets across spark 
> sessions. Is that supposed to work? Otherwise you'd have to get a new query 
> execution to bind to new sparksession which would only let you share logical 
> plans. 
> I am sending pr along with the latter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19213) FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821736#comment-15821736
 ] 

Apache Spark commented on SPARK-19213:
--

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/16575

> FileSourceScanExec usese sparksession from hadoopfsrelation creation time 
> instead of the one active at time of execution
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the io code it would be beneficial to be able to use 
> the active session in order to be able to modify hadoop config without 
> recreating the dataset. What would be interesting is to not lock the spark 
> session in the physical plan for ios and let you share datasets across spark 
> sessions. Is that supposed to work? Otherwise you'd have to get a new query 
> execution to bind to new sparksession which would only let you share logical 
> plans. 
> I am sending pr along with the latter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip commented on SPARK-19177:
---

If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function(x) { x <- cbind(x, x$waiting * 60) , in some way, x has 
many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:19 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ){ x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ) { x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:19 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ) { x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ) { x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:18 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ) { x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function(x) { x <- cbind(x, x$waiting * 60) , in some way, x has 
many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:19 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  
function( x ) {x <- cbind(x, x$waiting * 60)} , in some way, x has many 
columns, and the new column has to be handled with an schema at the outside 
function dapply. How would yo define schema? You cannot append an structField 
to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  function( x ){ x <- cbind(x, x$waiting * 60) , in some way, x 
has many columns, and the new column has to be handled with an schema at the 
outside function dapply. How would yo define schema? You cannot append an 
structField to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:20 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is simple: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  
function( x ){x <- cbind(x, x$waiting * 60)} , in some way, x has many columns, 
and the new column has to be handled with an schema at the outside function 
dapply. How would yo define schema? You cannot append an structField to the 
structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is simple: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  
function( x ){x <- cbind(x, x$waiting * 60)} , in some way, x has many columns, 
and the new column has to be handled with an schema at the outside function 
dapply. How would yo define schema? You cannot append an structField to the 
structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19177) SparkR Data Frame operation between columns elements

2017-01-13 Thread Vicente Masip (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821758#comment-15821758
 ] 

Vicente Masip edited comment on SPARK-19177 at 1/13/17 1:20 PM:


If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is simple: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  
function( x ){x <- cbind(x, x$waiting * 60)} , in some way, x has many columns, 
and the new column has to be handled with an schema at the outside function 
dapply. How would yo define schema? You cannot append an structField to the 
structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.


was (Author: masip85):
If I want to specify schema with gapply or I NEED to specify it at dapply, I 
have had a problem.  The documentation example is beautiful: 

schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function( x ) { x <- cbind(x, x$waiting * 60) }, schema)

your returning data.frame inside function is 3 columns size. I have 50 columns, 
and I want to return them all again a new computed column. 

Imagine that:  
function( x ) {x <- cbind(x, x$waiting * 60)} , in some way, x has many 
columns, and the new column has to be handled with an schema at the outside 
function dapply. How would yo define schema? You cannot append an structField 
to the structType.

Finally I'm going to solve it with a dummy new column specified with a lit, 
getting it's new schema and deleting the new column. Not elegant, but I keep on 
my work.

> SparkR Data Frame operation between columns elements
> 
>
> Key: SPARK-19177
> URL: https://issues.apache.org/jira/browse/SPARK-19177
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
>Reporter: Vicente Masip
>Priority: Minor
>  Labels: schema, sparkR, struct
>
> I have commented this in other thread, but I think it can be important to 
> clarify that:
> What happen when you are working with 50 columns and gapply? Do I rewrite 50 
> columns scheme with it's new column from gapply operation? I think there is 
> no alternative because structFields cannot be appended to structType. Any 
> suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-13 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821773#comment-15821773
 ] 

Steve Loughran commented on SPARK-19111:


Just realised one more thing

If the allocated threads for writing data are all used up in PUT calls, the 
thread calling write() will be blocked. And that tuning I mentioned was "keep 
#of publisher threads down to minimise heap use".

This means: if the log creation rate is > upload bandwidth for some period of 
time, whichever thread is converting history events to write calls will block. 
Hadoop 2.8 s3a reduces risk here as it will buffer to HDD by default, but even 
there, if there's a mismatch between data generation and upload rates, you'll 
be in trouble

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs

[jira] [Created] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-01-13 Thread Alexander Alexandrov (JIRA)

Alexander Alexandrov created SPARK-19214:


 Summary: Inconsistencies between DataFrame and Dataset APIs
 Key: SPARK-19214
 URL: https://issues.apache.org/jira/browse/SPARK-19214
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Alexander Alexandrov
Priority: Trivial


I am not sure whether this has been reported already, but there are some 
confusing & annoying inconsistencies when programming the same expression in 
the Dataset and the DataFrame APIs.

Consider the following minimal example executed in a Spark Shell:


{code}
case class Point(x: Int, y: Int, z: Int)

val ps = spark.createDataset(for {
  x <- 1 to 10; 
  y <- 1 to 10; 
  z <- 1 to 10
} yield Point(x, y, z))

// Problem 1:
// count produces different fields in the Dataset / DataFrame variants

// count() on grouped DataFrame: field name is `count`
ps.groupBy($"x").count().printSchema
// root
//  |-- x: integer (nullable = false)
//  |-- count: long (nullable = false)

// count() on grouped Dataset: field name is `count(1)`
ps.groupByKey(_.x).count().printSchema
// root
//  |-- value: integer (nullable = true)
//  |-- count(1): long (nullable = false)

// Problem 2:
// groupByKey produces different `key` field name depending
// on the result type
// this is especially confusing in the first case below (simple key types)
// where the key field is actually named `value`

// simple key types
ps.groupByKey(p => p.x).count().printSchema
// root
//  |-- value: integer (nullable = true)
//  |-- count(1): long (nullable = false)

// complex key types
ps.groupByKey(p => (p.x, p.y)).count().printSchema
// root
//  |-- key: struct (nullable = false)
//  ||-- _1: integer (nullable = true)
//  ||-- _2: integer (nullable = true)
//  |-- count(1): long (nullable = false)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18667) input_file_name function does not work with UDF

2017-01-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821852#comment-15821852
 ] 

Liang-Chi Hsieh commented on SPARK-18667:
-

Hi [~someonehere15],

Thanks for providing the info. I can reproduce the issue on spark-xml package 
when applying UDF on the column of input_file_name, the result will be empty.

I will take a look into this issue.



> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17568) Add spark-submit option for user to override ivy settings used to resolve packages/artifacts

2017-01-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821862#comment-15821862
 ] 

Apache Spark commented on SPARK-17568:
--

User 'themodernlife' has created a pull request for this issue:
https://github.com/apache/spark/pull/16563

> Add spark-submit option for user to override ivy settings used to resolve 
> packages/artifacts
> 
>
> Key: SPARK-17568
> URL: https://issues.apache.org/jira/browse/SPARK-17568
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> The {{--packages}} option to {{spark-submit}} uses Ivy to map Maven 
> coordinates to package jars. Currently, the IvySettings are hard-coded with 
> Maven Central as the last repository in the chain of resolvers. 
> At IBM, we have heard from several enterprise clients that are frustrated 
> with lack of control over their local Spark installations. These clients want 
> to ensure that certain artifacts can be excluded or patched due to security 
> or license issues. For example, a package may use a vulnerable SSL protocol; 
> or a package may link against an AGPL library written by a litigious 
> competitor.
> While additional repositories and exclusions can be added on the spark-submit 
> command line, this falls short of what is needed. With Maven Central always 
> as a fall-back repository, it is difficult to ensure only approved artifacts 
> are used and it is often the exclusions that site admins are not aware of 
> that can cause problems. Also, known exclusions are better handled through a 
> centralized managed repository rather than as command line arguments.
> To resolve these issues, we propose the following change: allow the user to 
> specify an Ivy Settings XML file to pass in as an optional argument to 
> {{spark-submit}} (or specify in a config file) to define alternate 
> repositories used to resolve artifacts instead of the hard-coded defaults. 
> The use case for this would be to define a managed repository (such as Nexus) 
> in the settings file so that all requests for artifacts go through one 
> location only.
> Example usage:
> {noformat}
> $SPARK_HOME/bin/spark-submit --conf 
> spark.ivy.settings=/path/to/ivysettings.xml  myapp.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19215) Add necessary check for `RDD.checkpoint` to avoid potential mistakes

2017-01-13 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-19215:
--

 Summary: Add necessary check for `RDD.checkpoint` to avoid 
potential mistakes
 Key: SPARK-19215
 URL: https://issues.apache.org/jira/browse/SPARK-19215
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weichen Xu


Currently RDD.checkpoint must be called before any job executed on this RDD, 
otherwise the `doCheckpoint` will never be called. This is a pitfall we should 
check this and throw exception (or at least log warning ? ) for such case.
And, if RDD haven't been persisted, doing checkpoint will cause RDD 
recomputation, because current implementation will run separated job for 
checkpointing. I think such case it should also print some warning message, 
remind user to check whether he forgot persist the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19136) Aggregator with case class as output type fails with ClassCastException

2017-01-13 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821868#comment-15821868
 ] 

Andrew Ray commented on SPARK-19136:


I forgot you can also just do:
{code}
ds.select(MinMaxAgg().toColumn)
{code}

> Aggregator with case class as output type fails with ClassCastException
> ---
>
> Key: SPARK-19136
> URL: https://issues.apache.org/jira/browse/SPARK-19136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mathieu D
>Priority: Minor
>
> {{Aggregator}} with a case-class as output type returns a Row that cannot be 
> cast back to this type, it fails with {{ClassCastException}}.
> Here is a dummy example to reproduce the problem 
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
> import org.apache.spark.sql.expressions.Aggregator
> import spark.implicits._
> case class MinMax(min: Int, max: Int)
> case class MinMaxAgg() extends Aggregator[Row, (Int, Int), MinMax] with 
> Serializable {
>   def zero: (Int, Int) = (Int.MaxValue, Int.MinValue)
>   def reduce(b: (Int, Int), a: Row): (Int, Int) = (Math.min(b._1, 
> a.getAs[Int](0)), Math.max(b._2, a.getAs[Int](0)))
>   def finish(r: (Int, Int)): MinMax = MinMax(r._1, r._2)
>   def merge(b1: (Int, Int), b2: (Int, Int)): (Int, Int) = (Math.min(b1._1, 
> b2._1), Math.max(b1._2, b2._2))
>   def bufferEncoder: Encoder[(Int, Int)] = ExpressionEncoder()
>   def outputEncoder: Encoder[MinMax] = ExpressionEncoder()
> }
> val ds = Seq(1, 2, 3, 4).toDF("col1")
> val agg = ds.agg(MinMaxAgg().toColumn.alias("minmax"))
> {code}
> bq. {code}
> ds: org.apache.spark.sql.DataFrame = [col1: int]
> agg: org.apache.spark.sql.DataFrame = [minmax: struct]
> {code}
> {code}agg.printSchema(){code}
> bq. {code}
> root
>  |-- minmax: struct (nullable = true)
>  ||-- min: integer (nullable = false)
>  ||-- max: integer (nullable = false)
> {code}
> {code}agg.head(){code}
> bq. {code}
> res1: org.apache.spark.sql.Row = [[1,4]]
> {code}
> {code}agg.head().getAs[MinMax](0){code}
> bq. {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to line4c81e18af34342cda654c381ee91139525.$read$$iw$$iw$$iw$$iw$MinMax
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19215) Add necessary check for `RDD.checkpoint` to avoid potential mistakes

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19215:


Assignee: Apache Spark

> Add necessary check for `RDD.checkpoint` to avoid potential mistakes
> 
>
> Key: SPARK-19215
> URL: https://issues.apache.org/jira/browse/SPARK-19215
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently RDD.checkpoint must be called before any job executed on this RDD, 
> otherwise the `doCheckpoint` will never be called. This is a pitfall we 
> should check this and throw exception (or at least log warning ? ) for such 
> case.
> And, if RDD haven't been persisted, doing checkpoint will cause RDD 
> recomputation, because current implementation will run separated job for 
> checkpointing. I think such case it should also print some warning message, 
> remind user to check whether he forgot persist the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: (was: Apache Spark)

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19189) Optimize CartesianRDD to avoid parent RDD partition re-computation and re-serialization

2017-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19189:


Assignee: Apache Spark

> Optimize CartesianRDD to avoid parent RDD partition re-computation and 
> re-serialization
> ---
>
> Key: SPARK-19189
> URL: https://issues.apache.org/jira/browse/SPARK-19189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current CartesianRDD implementation, suppose RDDA cartisian RDDB, generating 
> RDDC,
> each RDDA partition will be reading by multiple RDDC partition, and RDDB has 
> similar problem.
> This will cause, when RDDC partition computing, each partition's data in RDDA 
> or RDDB will be repeatedly serialized (then transfer through network), if 
> RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 245 matches

Mail list logo