date:20150716

[jira] [Created] (SPARK-9114) The returned value is not converted into internal type in Python UDF

2015-07-16 Thread Davies Liu (JIRA)

Davies Liu created SPARK-9114:
-

 Summary: The returned value is not converted into internal type in 
Python UDF
 Key: SPARK-9114
 URL: https://issues.apache.org/jira/browse/SPARK-9114
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


The returned value is not converted into internal type in Python UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6941:

Shepherd: Yin Huai

 Provide a better error message to explain that tables created from RDDs are 
 immutable
 -

 Key: SPARK-6941
 URL: https://issues.apache.org/jira/browse/SPARK-6941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yijie Shen
Priority: Blocker

 We should explicitly let users know that tables created from RDDs are 
 immutable and new rows cannot be inserted into it. We can add a better error 
 message and also explain it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9082:

Shepherd: Yin Huai

 Filter using non-deterministic expressions should not be pushed down
 

 Key: SPARK-9082
 URL: https://issues.apache.org/jira/browse/SPARK-9082
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Wenchen Fan

 For example,
 {code}
 val df = sqlContext.range(1, 10).select($id, rand(0).as('r))
 df.as(a).join(df.filter($r  0.5).as(b), $a.id === 
 $b.id).explain(true)
 {code}
 The plan is 
 {code}
 == Physical Plan ==
 ShuffledHashJoin [id#55323L], [id#55327L], BuildRight
  Exchange (HashPartitioning 200)
   Project [id#55323L,Rand 0 AS r#55324]
PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37
  Exchange (HashPartitioning 200)
   Project [id#55327L,Rand 0 AS r#55325]
Filter (LessThan)
 PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37
 {code}
 The rand get evaluated twice instead of once. 
 This is caused by when we push down predicates we replace the attribute 
 reference in the predicate with the actual expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9112) Implement LogisticRegressionSummary similar to LinearRegressionSummary

2015-07-16 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630053#comment-14630053
 ] 

Manoj Kumar commented on SPARK-9112:


Yes, that is the idea. Also we need not port it to ML right now, we could 
convert the transformed the dataframe to the required input type in mllib

Also it might be useful to return the probability for the predicted class (as 
done by predict_proba in scikit-learn) . how does that sound?

 Implement LogisticRegressionSummary similar to LinearRegressionSummary
 --

 Key: SPARK-9112
 URL: https://issues.apache.org/jira/browse/SPARK-9112
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Manoj Kumar
Priority: Minor

 Since the API for LinearRegressionSummary has been merged, other models 
 should follow suit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9102) Improve project collapse with nondeterministic expressions

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9102:

Shepherd: Yin Huai

 Improve project collapse with nondeterministic expressions
 --

 Key: SPARK-9102
 URL: https://issues.apache.org/jira/browse/SPARK-9102
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9102) Improve project collapse with nondeterministic expressions

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9102:

Assignee: Wenchen Fan

 Improve project collapse with nondeterministic expressions
 --

 Key: SPARK-9102
 URL: https://issues.apache.org/jira/browse/SPARK-9102
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9015.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7375
[https://github.com/apache/spark/pull/7375]

 Maven cleanup / Clean Project Import in scala-ide
 -

 Key: SPARK-9015
 URL: https://issues.apache.org/jira/browse/SPARK-9015
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Jan Prach
Priority: Minor
 Fix For: 1.5.0


 Cleanup maven for a clean import in scala-ide / eclipse.
 The outstanging PR contains things like removal of groovy plugin and some 
 more maven cleanup goes here.
 In order to make it a seamless experience two more things have to be merged 
 upstream:
 1) ide automatically generate jva sources from idl - 
 https://issues.apache.org/jira/browse/AVRO-1671
 2) set scala version in ide based on maven config - 
 https://github.com/sonatype/m2eclipse-scala/issues/30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9015:
-
Assignee: Jan Prach

 Maven cleanup / Clean Project Import in scala-ide
 -

 Key: SPARK-9015
 URL: https://issues.apache.org/jira/browse/SPARK-9015
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Jan Prach
Assignee: Jan Prach
Priority: Minor
 Fix For: 1.5.0


 Cleanup maven for a clean import in scala-ide / eclipse.
 The outstanging PR contains things like removal of groovy plugin and some 
 more maven cleanup goes here.
 In order to make it a seamless experience two more things have to be merged 
 upstream:
 1) ide automatically generate jva sources from idl - 
 https://issues.apache.org/jira/browse/AVRO-1671
 2) set scala version in ide based on maven config - 
 https://github.com/sonatype/m2eclipse-scala/issues/30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6217) insertInto doesn't work in PySpark

2015-07-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-6217.
--
Assignee: Wenchen Fan

 insertInto doesn't work in PySpark
 --

 Key: SPARK-6217
 URL: https://issues.apache.org/jira/browse/SPARK-6217
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
 Environment: Mac OS X Yosemite 10.10.2
 Python 2.7.9
 Spark 1.3.0
Reporter: Charles Cloud
Assignee: Wenchen Fan

 The following code, running in an IPython shell throws an error:
 {code:none}
 In [1]: from pyspark import SparkContext, HiveContext
 In [2]: sc = SparkContext('local[*]', 'test')
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 In [3]: sql = HiveContext(sc)
 In [4]: import pandas as pd
 In [5]: df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [1, 2, 3], 'c': 
 list('abc')})
 In [6]: df2 = pd.DataFrame({'a': [2.0, 3.0, 4.0], 'b': [4, 5, 6], 'c': 
 list('def')})
 In [7]: sdf = sql.createDataFrame(df)
 In [8]: sdf2 = sql.createDataFrame(df2)
 In [9]: sql.registerDataFrameAsTable(sdf, 'sdf')
 In [10]: sql.registerDataFrameAsTable(sdf2, 'sdf2')
 In [11]: sql.cacheTable('sdf')
 In [12]: sql.cacheTable('sdf2')
 In [13]: sdf2.insertInto('sdf')  # throws an error
 {code}
 Here's the Java traceback:
 {code:none}
 Py4JJavaError: An error occurred while calling o270.insertInto.
 : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable 
 (LogicalRDD [a#0,b#1L,c#2], MapPartitionsRDD[13] at mapPartitions at 
 SQLContext.scala:1167), Map(), false
  InMemoryRelation [a#6,b#7L,c#8], true, 1, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [a#6,b#7L,c#8], MapPartitionsRDD[41] at 
 mapPartitions at SQLContext.scala:1167), Some(sdf2)
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1085)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1083)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1089)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1089)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.insertInto(DataFrame.scala:1134)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at 
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 I'd be ecstatic if this was my own fault, and I'm somehow using it 
 incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6941.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7342
[https://github.com/apache/spark/pull/7342]

 Provide a better error message to explain that tables created from RDDs are 
 immutable
 -

 Key: SPARK-6941
 URL: https://issues.apache.org/jira/browse/SPARK-6941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yijie Shen
Priority: Blocker
 Fix For: 1.5.0


 We should explicitly let users know that tables created from RDDs are 
 immutable and new rows cannot be inserted into it. We can add a better error 
 message and also explain it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8682) Range Join for Spark SQL

2015-07-16 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630449#comment-14630449
 ] 

Herman van Hovell edited comment on SPARK-8682 at 7/16/15 10:31 PM:


I have attached some performance testing code.

In this setup RangeJoin is 13-50 times faster than the Cartesian/Filter 
combination. However the performance profile is a bit unexpected. The fewer 
records in the broadcasted, side the faster it is. This is opposite to my 
expectations, because RangeJoin should have a bigger advantage when the number 
of broadcasted rows are larger. I am looking into this.


was (Author: hvanhovell):
Some Performance Testing code.

 Range Join for Spark SQL
 

 Key: SPARK-8682
 URL: https://issues.apache.org/jira/browse/SPARK-8682
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
 Attachments: perf_testing.scala


 Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered 
 Cartesian Join) when it has to execute the following range query:
 {noformat}
 SELECT A.*,
B.*
 FROM   tableA A
JOIN tableB B
 ON A.start = B.end
  AND A.end  B.start
 {noformat}
 This is horribly inefficient. The performance of this query can be greatly 
 improved, when one of the tables can be broadcasted, by creating a range 
 index. A range index is basically a sorted map containing the rows of the 
 smaller table, indexed by both the high and low keys. using this structure 
 the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = 
 number of records in the larger table, M = number of records in the smaller 
 (indexed) table.
 I have created a pull request for this. According to the [Spark SQL: 
 Relational Data Processing in 
 Spark|http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf] 
 paper similar work (page 11, section 7.2) has already been done by the ADAM 
 project (cannot locate the code though). 
 Any comments and/or feedback are greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9119) In some cases, we may save wrong decimal values to parquet

2015-07-16 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630556#comment-14630556
 ] 

Yin Huai commented on SPARK-9119:
-

Actually, the impact of this issue is that whenever we store decimal values 
using its unscaled value and type's scale information, we may store the wrong 
value.

 In some cases, we may save wrong decimal values to parquet
 --

 Key: SPARK-9119
 URL: https://issues.apache.org/jira/browse/SPARK-9119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 {code}
  
 import org.apache.spark.sql.Row
 import 
 org.apache.spark.sql.types.{StructType,StructField,StringType,DecimalType}
 import org.apache.spark.sql.types.Decimal
 
 val schema = StructType(Array(StructField(name, DecimalType(10, 5), false)))
 val rowRDD = sc.parallelize(Array(Row(Decimal(67123.45
 val df = sqlContext.createDataFrame(rowRDD, schema)
 df.registerTempTable(test)
 df.show()
 
 // ++
 // |name|
 // ++
 // |67123.45|
 // ++
 sqlContext.sql(create table testDecimal as select * from test)
 sqlContext.table(testDecimal).show()
 // ++
 // |name|
 // ++
 // |67.12345|
 // ++
 {code}
 The problem is when we do conversions, we do not use precision/scale info in 
 the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4134) Dynamic allocation: tone down scary executor lost messages when killing on purpose

2015-07-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4134:
-
Summary: Dynamic allocation: tone down scary executor lost messages when 
killing on purpose  (was: Tone down scary executor lost messages when killing 
on purpose)

 Dynamic allocation: tone down scary executor lost messages when killing on 
 purpose
 --

 Key: SPARK-4134
 URL: https://issues.apache.org/jira/browse/SPARK-4134
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or

 After SPARK-3822 goes in, we are now able to dynamically kill executors after 
 an application has started. However, when we do that we get a ton of scary 
 error messages telling us that we've done wrong somehow. It would be good to 
 detect when this is the case and prevent these messages from surfacing.
 This maybe difficult, however, because the connection manager tends to be 
 quite verbose in unconditionally logging disconnection messages. This is a 
 very nice-to-have for 1.2 but certainly not a blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9116) Class in main cannot be serialized by PySpark

2015-07-16 Thread Davies Liu (JIRA)

Davies Liu created SPARK-9116:
-

 Summary: Class in __main__ cannot be serialized by PySpark
 Key: SPARK-9116
 URL: https://issues.apache.org/jira/browse/SPARK-9116
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


It's bad that we could not support classes defined in __main__.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9113) remove unnecessary analysis check code for self join

2015-07-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9113:

Shepherd: Michael Armbrust
Assignee: Wenchen Fan

 remove unnecessary analysis check code for self join
 

 Key: SPARK-9113
 URL: https://issues.apache.org/jira/browse/SPARK-9113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630307#comment-14630307
 ] 

Joseph K. Bradley commented on SPARK-9073:
--

Yes, thanks!  I'll look at the PR as soon as I can.

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9073:
-
Shepherd: Joseph K. Bradley

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8807) Add between operator in SparkR

2015-07-16 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8807.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7356
[https://github.com/apache/spark/pull/7356]

 Add between operator in SparkR
 --

 Key: SPARK-8807
 URL: https://issues.apache.org/jira/browse/SPARK-8807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa
 Fix For: 1.5.0


 Add between operator in SparkR
 ```
 df$age between c(1, 2)
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8807) Add between operator in SparkR

2015-07-16 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8807:
-
Assignee: Liang-Chi Hsieh

 Add between operator in SparkR
 --

 Key: SPARK-8807
 URL: https://issues.apache.org/jira/browse/SPARK-8807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Add between operator in SparkR
 ```
 df$age between c(1, 2)
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8972) Incorrect result for rollup

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8972:

Assignee: Cheng Hao

 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Critical
 Fix For: 1.5.0


 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}
 After checking with the code, seems we does't support the complex expressions 
 (not just simple column names) for GROUP BY keys for rollup, as well as the 
 cube. And it even will not report it if we have complex expression in the 
 rollup keys, hence we get very confusing result as the example above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8972) Incorrect result for rollup

2015-07-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8972.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7343
[https://github.com/apache/spark/pull/7343]

 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Critical
 Fix For: 1.5.0


 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}
 After checking with the code, seems we does't support the complex expressions 
 (not just simple column names) for GROUP BY keys for rollup, as well as the 
 cube. And it even will not report it if we have complex expression in the 
 rollup keys, hence we get very confusing result as the example above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang commented on SPARK-8646:
-

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9091) Add the codec interface to DStream.

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9091.
--
Resolution: Invalid

[~carlmartin] I think you're familiar with 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- 
please don't open a JIRA until you can fill it out correctly. This isn't a Bug, 
Major, can't affect 1.5, has no detail.

 Add the codec interface to DStream.
 ---

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus

 Add description later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9093) Fix single-quotes strings in SparkR

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629477#comment-14629477
 ] 

Apache Spark commented on SPARK-9093:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/7439

 Fix single-quotes strings in SparkR
 ---

 Key: SPARK-9093
 URL: https://issues.apache.org/jira/browse/SPARK-9093
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 We should get rid of the warnings about using single-quotes like that.
 {noformat}
 inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes.
list(type = 'array', elementType = integer, containsNull = 
 TRUE))
^~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629483#comment-14629483
 ] 

SaintBacchus commented on SPARK-9091:
-

[~srowen] Sorry for forgetting to change the type and add the description 
immediately, it's a small improvement in the DStream.
I will reopen it after I had the PR.


 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gisle Ytrestøl updated SPARK-9096:
--
Description: 
When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed 
in the the following operations on the new JavaRDD which is created by 
subtract. The result is that in the following operation on the new JavaRDD, a 
few tasks process almost all the data, and these tasks will take a long time to 
finish. 

I've reproduced this bug in the attached Java file, which I submit with 
spark-submit. 
The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks in 
the count job takes a lot of time:

15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 
4659) in 708 ms on 148.251.190.217 (1597/1600)
15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 
4786) in 772 ms on 148.251.190.217 (1598/1600)
15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 
4582) in 275019 ms on 148.251.190.217 (1599/1600)
15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 
4430) in 407020 ms on 148.251.190.217 (1600/1600)
15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have 
all completed, from pool 
15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at 
ReproduceBug.java:56) finished in 420.024 s
15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at 
ReproduceBug.java:56, took 442.941395 s

In comparison, all tasks are more or less equal in size when running the same 
application in Spark 1.3.1. In overall, this
attached application (ReproduceBug.java) takes about 7 minutes on Spark 1.4.1, 
and completes in roughly 30 seconds in Spark 1.3.1. 
Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.


  was:
When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed 
in the the following operations on the new JavaRDD which is created by 
subtract. The result is that in the following operation on the new JavaRDD, a 
few tasks process almost all the data, and these tasks will take a long time to 
finish. 




 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
 Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, 
 reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 
 I've reproduced this bug in the attached Java file, which I submit with 
 spark-submit. 
 The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks 
 in the count job takes a lot of time:
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 
 4659) in 708 ms on 148.251.190.217 (1597/1600)
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 
 4786) in 772 ms on 148.251.190.217 (1598/1600)
 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 
 4582) in 275019 ms on 148.251.190.217 (1599/1600)
 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 
 4430) in 407020 ms on 148.251.190.217 (1600/1600)
 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
 have all completed, from pool 
 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at 
 ReproduceBug.java:56) finished in 420.024 s
 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at 
 ReproduceBug.java:56, took 442.941395 s
 In comparison, all tasks are more or less equal in size when running the same 
 application in Spark 1.3.1. In overall, this
 attached application (ReproduceBug.java) takes about 7 minutes on Spark 
 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. 
 Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)

KaiXinXIaoLei created SPARK-9097:


 Summary: Tasks are not completed but the number of executor is zero
 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Fix For: 1.5.0


I set the value of spark.dynamicAllocation.enabled is true. I submit tasks to 
run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9096:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I am not sure it is a bug, yet. It's worth explaining the difference though, 
but we need to rule out environment factors, and know more about the cause. Can 
you say more about why the data is not evenly distributed? it looks like it 
should be in your sample.

 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
Priority: Minor
 Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, 
 reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 
 I've reproduced this bug in the attached Java file, which I submit with 
 spark-submit. 
 The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks 
 in the count job takes a lot of time:
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 
 4659) in 708 ms on 148.251.190.217 (1597/1600)
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 
 4786) in 772 ms on 148.251.190.217 (1598/1600)
 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 
 4582) in 275019 ms on 148.251.190.217 (1599/1600)
 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 
 4430) in 407020 ms on 148.251.190.217 (1600/1600)
 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
 have all completed, from pool 
 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at 
 ReproduceBug.java:56) finished in 420.024 s
 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at 
 ReproduceBug.java:56, took 442.941395 s
 In comparison, all tasks are more or less equal in size when running the same 
 application in Spark 1.3.1. In overall, this
 attached application (ReproduceBug.java) takes about 7 minutes on Spark 
 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. 
 Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9093) Fix single-quotes strings in SparkR

2015-07-16 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629474#comment-14629474
 ] 

Yu Ishikawa commented on SPARK-9093:


I'm working this issue.

 Fix single-quotes strings in SparkR
 ---

 Key: SPARK-9093
 URL: https://issues.apache.org/jira/browse/SPARK-9093
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 We should get rid of the warnings about using single-quotes like that.
 {noformat}
 inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes.
list(type = 'array', elementType = integer, containsNull = 
 TRUE))
^~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9091:
-
Affects Version/s: (was: 1.5.0)

Much better, though it can't affect 1.5.0 as this version does not exist yet.

I think TD's opinion is that DStream's methods are mostly redundant, since they 
are just operations you can access with DStream.foreachRDD and then calling any 
RDD methods you like. So I think this will not be worth adding to DStream's API.

 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very useful for user to have this 
 interface in DStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629527#comment-14629527
 ] 

Kai Sasaki commented on SPARK-9073:
---

[~josephkb] Hi, if possible, can I work on this JIRA? Thank you.

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9095) Removes old Parquet support code

2015-07-16 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-9095:
-

 Summary: Removes old Parquet support code
 Key: SPARK-9095
 URL: https://issues.apache.org/jira/browse/SPARK-9095
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian


As the new Parquet external data source matures, we should remove the old 
Parquet support now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gisle Ytrestøl updated SPARK-9096:
--
Description: 
When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed 
in the the following operations on the new JavaRDD which is created by 
subtract. The result is that in the following operation on the new JavaRDD, a 
few tasks process almost all the data, and these tasks will take a long time to 
finish. 



  was:When using JavaRDD.subtract(), it seems that the tasks are unevenly 
distributed in the the following operations on the new JavaRDD which is created 
by subtract. The result is that in the following operation on the new 
JavaRDD, a few tasks process almost all the data, and these tasks will take a 
long time to finish. 


 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
 Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, 
 reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gisle Ytrestøl updated SPARK-9096:
--
Attachment: reproduce.1.4.1.log.gz
reproduce.1.3.1.log.gz
ReproduceBug.java

 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
 Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, 
 reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA

Gisle Ytrestøl created SPARK-9096:
-

 Summary: Unevenly distributed task loads after using 
JavaRDD.subtract()
 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
 Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz, 
reproduce.1.4.1.log.gz

When using JavaRDD.subtract(), it seems that the tasks are unevenly distributed 
in the the following operations on the new JavaRDD which is created by 
subtract. The result is that in the following operation on the new JavaRDD, a 
few tasks process almost all the data, and these tasks will take a long time to 
finish. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9095) Removes old Parquet support code

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9095:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Removes old Parquet support code
 

 Key: SPARK-9095
 URL: https://issues.apache.org/jira/browse/SPARK-9095
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 As the new Parquet external data source matures, we should remove the old 
 Parquet support now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9093) Fix single-quotes strings in SparkR

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9093:
---

Assignee: Apache Spark

 Fix single-quotes strings in SparkR
 ---

 Key: SPARK-9093
 URL: https://issues.apache.org/jira/browse/SPARK-9093
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Apache Spark

 We should get rid of the warnings about using single-quotes like that.
 {noformat}
 inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes.
list(type = 'array', elementType = integer, containsNull = 
 TRUE))
^~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Description: Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream  (was: Add description later.)

 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9093) Fix single-quotes strings in SparkR

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9093:
---

Assignee: (was: Apache Spark)

 Fix single-quotes strings in SparkR
 ---

 Key: SPARK-9093
 URL: https://issues.apache.org/jira/browse/SPARK-9093
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 We should get rid of the warnings about using single-quotes like that.
 {noformat}
 inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes.
list(type = 'array', elementType = integer, containsNull = 
 TRUE))
^~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

2015-07-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629482#comment-14629482
 ] 

Liang-Chi Hsieh commented on SPARK-9067:


Thanks for reporting that.

I updated the PR. Besides calling close(), I also release reader now. Can you 
check if it can solve this problem? Thanks.

 Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
 -

 Key: SPARK-9067
 URL: https://issues.apache.org/jira/browse/SPARK-9067
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.3.0, 1.4.0
 Environment: Target system: Linux, 16 cores, 400Gb RAM
 Spark is started locally using the following command:
 {{
 spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
 --num-executors 1  --executor-memory 64G
 }}
Reporter: konstantin knizhnik

 If coalesce transformation with small number of output partitions (in my case 
 16) is applied to large Parquet file (in my has about 150Gb with 215k 
 partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
 open file limit exhaustion (with limit set to 8k).
 The source of the problem is in SqlNewHad\oopRDD.compute method:
 {quote}
   val reader = format.createRecordReader(
 split.serializableHadoopSplit.value, hadoopAttemptContext)
   reader.initialize(split.serializableHadoopSplit.value, 
 hadoopAttemptContext)
   // Register an on-task-completion callback to close the input stream.
   context.addTaskCompletionListener(context = close())
 {quote}
 Created Parquet file reader is intended to be closed at task completion time. 
 This reader contains a lot of references to  parquet.bytes.BytesInput object 
 which in turn contains reference sot large byte arrays (some of them are 
 several megabytes).
 As far as in case of CoalescedRDD task is completed only after processing 
 larger number of parquet files, it cause file handles exhaustion and memory 
 overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Description: 
Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream. 
In some IO-bottleneck scenario, it's very useful for user to have this 
interface in DStream.

  was:
Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream. 
In some IO-bottleneck scenario, it's very useful for user to had this interface.


 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very useful for user to have this 
 interface in DStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9095) Removes old Parquet support code

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629546#comment-14629546
 ] 

Apache Spark commented on SPARK-9095:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7441

 Removes old Parquet support code
 

 Key: SPARK-9095
 URL: https://issues.apache.org/jira/browse/SPARK-9095
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 As the new Parquet external data source matures, we should remove the old 
 Parquet support now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9095) Removes old Parquet support code

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9095:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Removes old Parquet support code
 

 Key: SPARK-9095
 URL: https://issues.apache.org/jira/browse/SPARK-9095
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark

 As the new Parquet external data source matures, we should remove the old 
 Parquet support now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9093) Fix single-quotes strings in SparkR

2015-07-16 Thread Yu Ishikawa (JIRA)

Yu Ishikawa created SPARK-9093:
--

 Summary: Fix single-quotes strings in SparkR
 Key: SPARK-9093
 URL: https://issues.apache.org/jira/browse/SPARK-9093
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa


We should get rid of the warnings about using single-quotes like that.

{noformat}
inst/tests/test_sparkSQL.R:60:28: style: Only use double-quotes.
   list(type = 'array', elementType = integer, containsNull = 
TRUE))
   ^~~
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Summary: Add the codec interface to Text DStream.  (was: Add the codec 
interface to DStream.)

 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Add description later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Priority: Minor  (was: Major)

 Add the codec interface to DStream.
 ---

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Add description later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Issue Type: Improvement  (was: Bug)

 Add the codec interface to DStream.
 ---

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus

 Add description later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Description: 
Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream. 
In some IO-bottleneck scenario, it's very useful for user to had this interface.

  was:
Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream. 
In some IO-bottleneck scenario, it's very 


 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very useful for user to had this 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-9091:

Description: 
Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream. 
In some IO-bottleneck scenario, it's very 

  was:Since the RDD has the function *saveAsTextFile* which can use 
*CompressionCodec* to compress the data, so it's better to add a similar 
interface in DStream


 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.0
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9094) Increase io.dropwizard.metrics dependency to 3.1.2

2015-07-16 Thread JIRA

Carl Anders Düvel created SPARK-9094:


 Summary: Increase io.dropwizard.metrics dependency to 3.1.2
 Key: SPARK-9094
 URL: https://issues.apache.org/jira/browse/SPARK-9094
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Carl Anders Düvel
Priority: Minor


This change is described in pull request:

https://github.com/apache/spark/pull/7422



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9052) Fix comments after curly braces

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629510#comment-14629510
 ] 

Apache Spark commented on SPARK-9052:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/7440

 Fix comments after curly braces
 ---

 Key: SPARK-9052
 URL: https://issues.apache.org/jira/browse/SPARK-9052
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 Right now we have a number of style check errors of the form 
 {code}
 Opening curly braces should never go on their own line and should always and 
 be followed by a new line.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9052) Fix comments after curly braces

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9052:
---

Assignee: (was: Apache Spark)

 Fix comments after curly braces
 ---

 Key: SPARK-9052
 URL: https://issues.apache.org/jira/browse/SPARK-9052
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 Right now we have a number of style check errors of the form 
 {code}
 Opening curly braces should never go on their own line and should always and 
 be followed by a new line.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9052) Fix comments after curly braces

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9052:
---

Assignee: Apache Spark

 Fix comments after curly braces
 ---

 Key: SPARK-9052
 URL: https://issues.apache.org/jira/browse/SPARK-9052
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Apache Spark

 Right now we have a number of style check errors of the form 
 {code}
 Opening curly braces should never go on their own line and should always and 
 be followed by a new line.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9097:
-
Fix Version/s: (was: 1.5.0)

[~KaiXinXIaoLei] Again please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Don't 
set Fix Version.

You need to provide more information. It's normal for a short time for tasks to 
be waiting, before an executor spins up. You need to clarify exactly how you 
are running this and what you observe or else this is not a helpful JIRA.

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei

 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9094) Increase io.dropwizard.metrics dependency to 3.1.2

2015-07-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629558#comment-14629558
 ] 

Sean Owen commented on SPARK-9094:
--

Yes, you also need to update your PR title. The request was as well to describe 
any potentially incompatible changes (if any) and why the update is needed.

 Increase io.dropwizard.metrics dependency to 3.1.2
 --

 Key: SPARK-9094
 URL: https://issues.apache.org/jira/browse/SPARK-9094
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Carl Anders Düvel
Priority: Minor

 This change is described in pull request:
 https://github.com/apache/spark/pull/7422



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8807) Add between operator in SparkR

2015-07-16 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629391#comment-14629391
 ] 

Yu Ishikawa commented on SPARK-8807:


[~yalamart] Sorry for the delay of my reply. And great work! Thanks!

 Add between operator in SparkR
 --

 Key: SPARK-8807
 URL: https://issues.apache.org/jira/browse/SPARK-8807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Add between operator in SparkR
 ```
 df$age between c(1, 2)
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9092) Make --num-executors compatible with dynamic allocation

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9092:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 Make --num-executors compatible with dynamic allocation
 ---

 Key: SPARK-9092
 URL: https://issues.apache.org/jira/browse/SPARK-9092
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Niranjan Padmanabhan
Priority: Minor

 Currently when you enable dynamic allocation, you can't use --num-executors 
 or the property spark.executor.instances. If we are to enable dynamic 
 allocation by default, we should make these work so that existing workloads 
 don't fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6442) MLlib Local Linear Algebra Package

2015-07-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629320#comment-14629320
 ] 

Sean Owen commented on SPARK-6442:
--

[~mengxr] for Commons Math, and point #2: actually they decided to un-deprecate 
the sparse implementations in 3.3 onwards, and keep supporting them: 
http://commons.apache.org/proper/commons-math/changes-report.html I think it's 
a good option.

But I also am not sure why *Spark* has to decide this for users. Spark can do 
whatever it likes internally; apps can do whatever they like externally; both 
can and should use a library. From an API perspective, all that's needed is a 
representation of the data that thunks easily into other libraries, rather than 
provide a library of functions again.

 MLlib Local Linear Algebra Package
 --

 Key: SPARK-6442
 URL: https://issues.apache.org/jira/browse/SPARK-6442
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Burak Yavuz
Priority: Critical

 MLlib's local linear algebra package doesn't have any support for any type of 
 matrix operations. With 1.5, we wish to add support to a complete package of 
 optimized linear algebra operations for Scala/Java users.
 The main goal is to support lazy operations so that element-wise can be 
 implemented in a single for-loop, and complex operations can be interfaced 
 through BLAS. 
 The design doc: http://goo.gl/sf5LCE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8646:
---

Assignee: Apache Spark

 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
Assignee: Apache Spark
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8646:
---

Assignee: (was: Apache Spark)

 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629407#comment-14629407
 ] 

Apache Spark commented on SPARK-8646:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7438

 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9019) spark-submit fails on yarn with kerberos enabled

2015-07-16 Thread Bolke de Bruin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629432#comment-14629432
 ] 

Bolke de Bruin edited comment on SPARK-9019 at 7/16/15 8:39 AM:


I tried running this on an updated environment where YARN-3103 was fixed, 
however it still fails although behavior is a bit different now. The task is 
now being accepted but stays in the running state forever without executing 
anything. Please note that the trace below is without key tab usage, but with 
an authorized user (kinit admin/admin)

15/07/16 04:27:34 DEBUG Client: getting client out of cache: 
org.apache.hadoop.ipc.Client@53abb73
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
[actor] received message AkkaMessage(ReviveOffers,false) from 
Actor[akka://sparkDriver/deadLetters]
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
Received RPC message: AkkaMessage(ReviveOffers,false)
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
[actor] handled message (1.632126 ms) AkkaMessage(ReviveOffers,false) from 
Actor[akka://sparkDriver/deadLetters]
15/07/16 04:27:34 DEBUG AbstractService: Service 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl is started
15/07/16 04:27:34 DEBUG AbstractService: Service 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
15/07/16 04:27:34 DEBUG Client: The ping interval is 6 ms.
15/07/16 04:27:34 DEBUG Client: Connecting to node6.local/10.79.10.6:8050
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin 
(auth:SIMPLE) 
from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
15/07/16 04:27:34 DEBUG SaslRpcClient: Sending sasl message state: NEGOTIATE

15/07/16 04:27:34 DEBUG SaslRpcClient: Received SASL message state: NEGOTIATE
auths {
  method: TOKEN
  mechanism: DIGEST-MD5
  protocol: 
  serverId: default
  challenge: 
realm=\default\,nonce=\wjgFp9L22uDJt41FNtY9M8CP/T+dswfBoF48r9+s\,qop=\auth\,charset=utf-8,algorithm=md5-sess
}
auths {
  method: KERBEROS
  mechanism: GSSAPI
  protocol: rm
  serverId: node6.local
}

15/07/16 04:27:34 DEBUG SaslRpcClient: Get token info proto:interface 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB 
info:org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo$2@69990fa7
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Looking for a token with 
service 10.79.10.6:8050
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
YARN_AM_RM_TOKEN and the token's service name is 
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
HIVE_DELEGATION_TOKEN and the token's service name is 
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
TIMELINE_DELEGATION_TOKEN and the token's service name is 10.79.10.6:8188
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
HDFS_DELEGATION_TOKEN and the token's service name is 10.79.10.4:8020
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException 
as:admin (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin 
(auth:SIMPLE) 
from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
15/07/16 04:27:34 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException 
as:admin (auth:SIMPLE) cause:java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG Client: closing ipc connection to 
node6.local/10.79.10.6:8050: org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
java.io.IOException: org.apache.hadoop.security.AccessControlException: Client 
cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at

[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8893:
-
Assignee: Daniel Darabos

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Assignee: Daniel Darabos
Priority: Trivial
 Fix For: 1.5.0


 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8893.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7285
[https://github.com/apache/spark/pull/7285]

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial
 Fix For: 1.5.0


 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

2015-07-16 Thread konstantin knizhnik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629399#comment-14629399
 ] 

konstantin knizhnik commented on SPARK-9067:


I have found workaround for the problem:  substitute task context with fake 
context. I have implemented CombineRDD as replacement of CoalescedRDD and 
create separate task context for processing each partition:

{quote}
class CombineRDD[T: ClassTag](prev: RDD[T], maxPartitions: Int) extends 
RDD[T](prev)
{
  val inputPartitions = prev.partitions
  class CombineIterator(partitions: Array[Partition], index: Int, context: 
TaskContext) extends Iterator[T]
  {
var iter : Iterator[T] = null
var i = index
def hasNext() : Boolean = {
  while ((iter == null || !iter.hasNext)  i  partitions.length) { 
val ctx = new CombineTaskContext(context.stageId, 
context.partitionId, context.taskAttemptId, context.attemptNumber, 
null/*context.taskMemoryManager*/, context.isRunningLocally, 
context.taskMetrics) 
iter = firstParent[T].compute(partitions(i), ctx)
//ctx.complete()
partitions(i) = null
i = i + maxPartitions
  }
  iter != null  iter.hasNext
}
  
def next() = { iter.next }
 }

 class CombineTaskContext(val stageId: Int,
   val partitionId: Int,
   override val taskAttemptId: Long,
   override val attemptNumber: Int,
   override val taskMemoryManager: TaskMemoryManager,
   val runningLocally: Boolean = true,
   val taskMetrics: TaskMetrics = null) extends TaskContext 
 {
   @transient private val onCompleteCallbacks = new 
ArrayBuffer[TaskCompletionListener]

   override def attemptId(): Long = taskAttemptId

   override def addTaskCompletionListener(listener: 
TaskCompletionListener): this.type = {
 onCompleteCallbacks += listener
 this
   }
   def complete(): Unit = {
 // Process complete callbacks in the reverse order of registration
 onCompleteCallbacks.reverse.foreach { listener =
   listener.onTaskCompletion(this)
 }
   }
   override def addTaskCompletionListener(f: TaskContext = Unit): 
this.type = {
 onCompleteCallbacks += new TaskCompletionListener {
   override def onTaskCompletion(context: TaskContext): Unit = 
f(context)
 }
 this
   }
   override def addOnCompleteCallback(f: () = Unit) {
 onCompleteCallbacks += new TaskCompletionListener {
   override def onTaskCompletion(context: TaskContext): Unit = f()
 }
   }
   override def isCompleted(): Boolean = false

   override def isRunningLocally(): Boolean = true

   override def isInterrupted(): Boolean = false
 } 


   
 case class CombinePartition(index : Int) extends Partition

 protected def getPartitions: Array[Partition] = 
Array.tabulate(maxPartitions){i = CombinePartition(i)}
 
 override def compute(partition: Partition, context: TaskContext): 
Iterator[T] = {
new CombineIterator(inputPartitions, partition.index, context)
 }
   }
{quote}

I works: no memory overflow or file limit exhaustion. But certainly it can not 
be considered as solution of the problem. Also please notice that I have to 
comment call of ctx.complete(), otherwise I got exception caused by access to 
closed stream. It is strange because I think that partition corresponds to 
single parquet file and so it can be proceeded independently. But looks like 
GCfinalization do their work.
   


 Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
 -

 Key: SPARK-9067
 URL: https://issues.apache.org/jira/browse/SPARK-9067
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.3.0, 1.4.0
 Environment: Target system: Linux, 16 cores, 400Gb RAM
 Spark is started locally using the following command:
 {{
 spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
 --num-executors 1  --executor-memory 64G
 }}
Reporter: konstantin knizhnik

 If coalesce transformation with small number of output partitions (in my case 
 16) is applied to large Parquet file (in my has about 150Gb with 215k 
 partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
 open file limit exhaustion (with limit set to 8k).
 The source of the problem is in SqlNewHad\oopRDD.compute method:
 {quote}
   val reader = format.createRecordReader(
 split.serializableHadoopSplit.value, hadoopAttemptContext)
   reader.initialize(split.serializableHadoopSplit.value, 
 hadoopAttemptContext)
   // Register an

[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

2015-07-16 Thread konstantin knizhnik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629358#comment-14629358
 ] 

konstantin knizhnik commented on SPARK-9067:


Sorry, but this patch doesn't help.
Looks like close is not closing everything...
For example there is reference  *parquet.io.RecordReaderT recordReader*
 in the class *parquet.hadoop.InternalParquetRecordReader*
and according to hprof dump at OutOfMemory exception it contains references to 
array of *parquet.column.impl.ColumnReaderImpl* and after few indirections we 
reach *parquet.column.values.bitpacking.ByteBitPackingValuesReader* which field 
+encoded+ references 9Mb array.
And InternalParquetRecordReader.close method doesn't close recordReader:

{quote}
  public void close() throws IOException {
if (reader != null) {
  reader.close();
}
  }
{quote}

Unfortunately I am not sure that it is the single place where close is not 
releasing all resources. Moreover I am not sure that even if close if close is 
done, it clears references to all used buffers. 

 Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
 -

 Key: SPARK-9067
 URL: https://issues.apache.org/jira/browse/SPARK-9067
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.3.0, 1.4.0
 Environment: Target system: Linux, 16 cores, 400Gb RAM
 Spark is started locally using the following command:
 {{
 spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
 --num-executors 1  --executor-memory 64G
 }}
Reporter: konstantin knizhnik

 If coalesce transformation with small number of output partitions (in my case 
 16) is applied to large Parquet file (in my has about 150Gb with 215k 
 partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
 open file limit exhaustion (with limit set to 8k).
 The source of the problem is in SqlNewHad\oopRDD.compute method:
 {quote}
   val reader = format.createRecordReader(
 split.serializableHadoopSplit.value, hadoopAttemptContext)
   reader.initialize(split.serializableHadoopSplit.value, 
 hadoopAttemptContext)
   // Register an on-task-completion callback to close the input stream.
   context.addTaskCompletionListener(context = close())
 {quote}
 Created Parquet file reader is intended to be closed at task completion time. 
 This reader contains a lot of references to  parquet.bytes.BytesInput object 
 which in turn contains reference sot large byte arrays (some of them are 
 several megabytes).
 As far as in case of CoalescedRDD task is completed only after processing 
 larger number of parquet files, it cause file handles exhaustion and memory 
 overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:25 AM:
--

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client. 



was (Author: lianhuiwang):
yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang edited comment on SPARK-8646 at 7/16/15 8:26 AM:
--

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client.  if this must be needed in spark-1.4.0, latter i will take a look at it.



was (Author: lianhuiwang):
yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch.  there is some problems on 
spark-1.4.0 branch because it finds pyspark libraries in sparkSubmit, not in 
Client. 


 PySpark does not run on YARN if master not provided in command line
 ---

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9019) spark-submit fails on yarn with kerberos enabled

2015-07-16 Thread Bolke de Bruin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629432#comment-14629432
 ] 

Bolke de Bruin commented on SPARK-9019:
---

I tried running this on an update environment, however it still fails although 
behavior is a bit different now. The task is now being accepted but stays in 
the running state forever without executing anything. Please note that the 
trace below is without key tab usage, but with an authorized user (kinit 
admin/admin)

15/07/16 04:27:34 DEBUG Client: getting client out of cache: 
org.apache.hadoop.ipc.Client@53abb73
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
[actor] received message AkkaMessage(ReviveOffers,false) from 
Actor[akka://sparkDriver/deadLetters]
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
Received RPC message: AkkaMessage(ReviveOffers,false)
15/07/16 04:27:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: 
[actor] handled message (1.632126 ms) AkkaMessage(ReviveOffers,false) from 
Actor[akka://sparkDriver/deadLetters]
15/07/16 04:27:34 DEBUG AbstractService: Service 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl is started
15/07/16 04:27:34 DEBUG AbstractService: Service 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
15/07/16 04:27:34 DEBUG Client: The ping interval is 6 ms.
15/07/16 04:27:34 DEBUG Client: Connecting to node6.local/10.79.10.6:8050
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin 
(auth:SIMPLE) 
from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
15/07/16 04:27:34 DEBUG SaslRpcClient: Sending sasl message state: NEGOTIATE

15/07/16 04:27:34 DEBUG SaslRpcClient: Received SASL message state: NEGOTIATE
auths {
  method: TOKEN
  mechanism: DIGEST-MD5
  protocol: 
  serverId: default
  challenge: 
realm=\default\,nonce=\wjgFp9L22uDJt41FNtY9M8CP/T+dswfBoF48r9+s\,qop=\auth\,charset=utf-8,algorithm=md5-sess
}
auths {
  method: KERBEROS
  mechanism: GSSAPI
  protocol: rm
  serverId: node6.local
}

15/07/16 04:27:34 DEBUG SaslRpcClient: Get token info proto:interface 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB 
info:org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo$2@69990fa7
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Looking for a token with 
service 10.79.10.6:8050
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
YARN_AM_RM_TOKEN and the token's service name is 
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
HIVE_DELEGATION_TOKEN and the token's service name is 
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
TIMELINE_DELEGATION_TOKEN and the token's service name is 10.79.10.6:8188
15/07/16 04:27:34 DEBUG RMDelegationTokenSelector: Token kind is 
HDFS_DELEGATION_TOKEN and the token's service name is 10.79.10.4:8020
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException 
as:admin (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedAction as:admin 
(auth:SIMPLE) 
from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
15/07/16 04:27:34 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG UserGroupInformation: PrivilegedActionException 
as:admin (auth:SIMPLE) cause:java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]
15/07/16 04:27:34 DEBUG Client: closing ipc connection to 
node6.local/10.79.10.6:8050: org.apache.hadoop.security.AccessControlException: 
Client cannot authenticate via:[TOKEN, KERBEROS]
java.io.IOException: org.apache.hadoop.security.AccessControlException: Client 
cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at

[jira] [Created] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala

2015-07-16 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-9098:
-

 Summary: Inconsistent Dense Vectors hashing between PySpark and 
Scala
 Key: SPARK-9098
 URL: https://issues.apache.org/jira/browse/SPARK-9098
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0, 1.3.1
Reporter: Maciej Szymkiewicz
Priority: Minor


When using Scala  it is possible to group a RDD using DenseVector as a key:

{code}
import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(
(Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
rdd.groupByKey.count
{code}

returns 1 as expected.

In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the 
{{object}} and based on memory address:

{code}
from pyspark.mllib.linalg import DenseVector

rdd = sc.parallelize(
[(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
rdd.groupByKey().count()
{code}

returns 2.

Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing 
doesn't look meaningful at all:

{code}
 dv = DenseVector([1, 2, 3])
 hdv1 = hash(dv)
 dv.array[0] = 3.0
 hdv2 = hash(dv)
 hdv1 == hdv2
True
 dv == DenseVector([1, 2, 3])
False
{code}

In my opinion the best approach would be to enforce immutability and provide a 
meaningful hashing. An alternative is to make {{DenseVector}} unhashable same 
as {{numpy.ndarray}}.


Source: 
http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9099) spark-ec2 does not add important ports to security group

2015-07-16 Thread Brian Sung-jin Hong (JIRA)

Brian Sung-jin Hong created SPARK-9099:
--

 Summary: spark-ec2 does not add important ports to security group
 Key: SPARK-9099
 URL: https://issues.apache.org/jira/browse/SPARK-9099
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0, 1.4.1
Reporter: Brian Sung-jin Hong


spark-ec2 scripts misses to add some few important ports to the security group, 
including:

Master 6066: Needed to submit jobs outside of the cluster
Slave 4040: Needed to view worker state
Slave 8082: Needed to view some worker logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-9044) Updated RDD name does not reflect under Storage tab

2015-07-16 Thread Zhang, Liye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-9044:
---
Comment: was deleted

(was: Well, I think the component is correct, still it's business of Web UI)

 Updated RDD name does not reflect under Storage tab
 -

 Key: SPARK-9044
 URL: https://issues.apache.org/jira/browse/SPARK-9044
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1, 1.4.0
 Environment: Mac  OSX
Reporter: Wenjie Zhang
Priority: Minor

 I was playing the spark-shell in my macbook, here is what I did:
 scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt);
 scala textFile.cache
 scala textFile.setName(test1)
 scala textFile.collect
 scala textFile.name
 res10: String = test1
 After this four commands, I can see the test1 RDD listed in the Storage 
 tab.
 However, if I continually run following commands, nothing will happen from 
 the Storage tab:
 scala textFile.setName(test2)
 scala textFile.cache
 scala textFile.collect
 scala textFile.name
 res10: String = test2
 I am expecting the name of the RDD shows in Storage tab should be test2, 
 is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629632#comment-14629632
 ] 

Apache Spark commented on SPARK-9091:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7442

 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very useful for user to have this 
 interface in DStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC

2015-07-16 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-9100:
-

 Summary: DataFrame reader/writer shortcut methods for ORC
 Key: SPARK-9100
 URL: https://issues.apache.org/jira/browse/SPARK-9100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala

2015-07-16 Thread Abou Haydar Elias (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629685#comment-14629685
 ] 

Abou Haydar Elias commented on SPARK-9098:
--

This issues creates an inconsistency in the API. 

So I totally agree with [~zero323] with enforcing immutability and providing a 
meaningful hashing. That can be a good approach.

 Inconsistent Dense Vectors hashing between PySpark and Scala
 

 Key: SPARK-9098
 URL: https://issues.apache.org/jira/browse/SPARK-9098
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.1, 1.4.0
Reporter: Maciej Szymkiewicz
Priority: Minor

 When using Scala  it is possible to group a RDD using DenseVector as a key:
 {code}
 import org.apache.spark.mllib.linalg.Vectors
 val rdd = sc.parallelize(
 (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
 rdd.groupByKey.count
 {code}
 returns 1 as expected.
 In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the 
 {{object}} and based on memory address:
 {code}
 from pyspark.mllib.linalg import DenseVector
 rdd = sc.parallelize(
 [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
 rdd.groupByKey().count()
 {code}
 returns 2.
 Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing 
 doesn't look meaningful at all:
 {code}
  dv = DenseVector([1, 2, 3])
  hdv1 = hash(dv)
  dv.array[0] = 3.0
  hdv2 = hash(dv)
  hdv1 == hdv2
 True
  dv == DenseVector([1, 2, 3])
 False
 {code}
 In my opinion the best approach would be to enforce immutability and provide 
 a meaningful hashing. An alternative is to make {{DenseVector}} unhashable 
 same as {{numpy.ndarray}}.
 Source: 
 http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9101) Can't use null in selectExpr

2015-07-16 Thread JIRA

Mateusz Buśkiewicz created SPARK-9101:
-

 Summary: Can't use null in selectExpr
 Key: SPARK-9101
 URL: https://issues.apache.org/jira/browse/SPARK-9101
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Mateusz Buśkiewicz


In 1.3.1 this worked:

{code:python}
df = sqlContext.createDataFrame([[1]], schema=['col'])
df.selectExpr('null as newCol').collect()
{code}

In 1.4.0 it fails with the following stacktrace:

{code}
Traceback (most recent call last):
  File input, line 1, in module
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/dataframe.py, 
line 316, in collect
cls = _create_cls(self.schema)
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/dataframe.py, 
line 229, in schema
self._schema = _parse_datatype_json_string(self._jdf.schema().json())
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, 
line 519, in _parse_datatype_json_string
return _parse_datatype_json_value(json.loads(json_string))
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, 
line 539, in _parse_datatype_json_value
return _all_complex_types[tpe].fromJson(json_value)
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, 
line 386, in fromJson
return StructType([StructField.fromJson(f) for f in json[fields]])
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, 
line 347, in fromJson
_parse_datatype_json_value(json[type]),
  File 
/opt/boxen/homebrew/opt/apache-spark/libexec/python/pyspark/sql/types.py, 
line 535, in _parse_datatype_json_value
raise ValueError(Could not parse datatype: %s % json_value)
ValueError: Could not parse datatype: null
{code}

https://github.com/apache/spark/blob/v1.4.0/python/pyspark/sql/types.py#L461

The cause:_atomic_types doesn't contain NullType





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9100:
---

Assignee: Cheng Lian  (was: Apache Spark)

 DataFrame reader/writer shortcut methods for ORC
 

 Key: SPARK-9100
 URL: https://issues.apache.org/jira/browse/SPARK-9100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9100:
---

Assignee: Apache Spark  (was: Cheng Lian)

 DataFrame reader/writer shortcut methods for ORC
 

 Key: SPARK-9100
 URL: https://issues.apache.org/jira/browse/SPARK-9100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9091) Add the codec interface to Text DStream.

2015-07-16 Thread SaintBacchus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629639#comment-14629639
 ] 

SaintBacchus commented on SPARK-9091:
-

[~sowen] I agree user can design the output by DStream.foreachRDD, I purpose it 
for convenience to use.
In my case, I had copy a bit code from Spark to adapt this function and I guess 
others may also have this scenario, so I open this Jira to push it into Spark. 

 Add the codec interface to Text DStream.
 

 Key: SPARK-9091
 URL: https://issues.apache.org/jira/browse/SPARK-9091
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus
Priority: Minor

 Since the RDD has the function *saveAsTextFile* which can use 
 *CompressionCodec* to compress the data, so it's better to add a similar 
 interface in DStream. 
 In some IO-bottleneck scenario, it's very useful for user to have this 
 interface in DStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-9097:
-
Attachment: number of executor is zero.png

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Attachments: number of executor is zero.png


 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9099) spark-ec2 does not add important ports to security group

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629649#comment-14629649
 ] 

Apache Spark commented on SPARK-9099:
-

User 'serialx' has created a pull request for this issue:
https://github.com/apache/spark/pull/7443

 spark-ec2 does not add important ports to security group
 

 Key: SPARK-9099
 URL: https://issues.apache.org/jira/browse/SPARK-9099
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0, 1.4.1
Reporter: Brian Sung-jin Hong

 spark-ec2 scripts misses to add some few important ports to the security 
 group, including:
 Master 6066: Needed to submit jobs outside of the cluster
 Slave 4040: Needed to view worker state
 Slave 8082: Needed to view some worker logs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629673#comment-14629673
 ] 

KaiXinXIaoLei commented on SPARK-9097:
--

I run a big  job. During running tasks, five tasks failed. Then  executors are 
killed. But there are many tasks to run. The log info:

2015-07-08 15:03:30,583 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 
167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 
167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 
167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 
167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 
167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | INFO  | [dag-scheduler-event-loop] | Executor lost: 
52 (epoch 29)
2015-07-08 15:03:30,584 | INFO  | [kill-executor-thread] | Requesting to kill 
executor(s) 52
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 
from BlockManagerMaster.
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager 
BlockManagerId(52, 9.91.8.174, 23424)
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Removed 52 
successfully in removeExecutor
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Host added was 
in lost list earlier: hostname

Then I can't find executors to  add, and not find failed task to re-submit. 
Thanks.

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Attachments: number of executor is zero.png, tasks are not 
 completed.png


 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629673#comment-14629673
 ] 

KaiXinXIaoLei edited comment on SPARK-9097 at 7/16/15 12:57 PM:


I run a big  job. During running tasks, five tasks failed. Then  executors are 
killed. But there are many tasks to run. The log info:

2015-07-08 15:03:30,583 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 
167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 
167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 
167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 
167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 
167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | INFO  | [dag-scheduler-event-loop] | Executor lost: 
52 (epoch 29)
2015-07-08 15:03:30,584 | INFO  | [kill-executor-thread] | Requesting to kill 
executor(s) 52
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 
from BlockManagerMaster.
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager 
BlockManagerId(52, 9.91.8.174, 23424)
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Removed 52 
successfully in removeExecutor
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Host added was 
in lost list earlier: hostname

Then I can't find executors to  add, and not find failed task to re-submit in 
log. Thanks.


was (Author: kaixinxiaolei):
I run a big  job. During running tasks, five tasks failed. Then  executors are 
killed. But there are many tasks to run. The log info:

2015-07-08 15:03:30,583 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1568.0 in stage 
167.0 (TID 25557, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1549.0 in stage 
167.0 (TID 25538, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1552.0 in stage 
167.0 (TID 25541, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1569.0 in stage 
167.0 (TID 25558, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | WARN  | 
[sparkDriver-akka.actor.default-dispatcher-43] | Lost task 1548.0 in stage 
167.0 (TID 25537, linux-174): ExecutorLostFailure (executor 52 lost) 
2015-07-08 15:03:30,584 | INFO  | [dag-scheduler-event-loop] | Executor lost: 
52 (epoch 29)
2015-07-08 15:03:30,584 | INFO  | [kill-executor-thread] | Requesting to kill 
executor(s) 52
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Trying to remove executor 52 
from BlockManagerMaster.
2015-07-08 15:03:30,585 | INFO  | 
[sparkDriver-akka.actor.default-dispatcher-30] | Removing block manager 
BlockManagerId(52, 9.91.8.174, 23424)
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Removed 52 
successfully in removeExecutor
2015-07-08 15:03:30,585 | INFO  | [dag-scheduler-event-loop] | Host added was 
in lost list earlier: hostname

Then I can't find executors to  add, and not find failed task to re-submit. 
Thanks.

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Attachments: number of executor is zero.png, tasks are not 
 completed.png


 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gisle Ytrestøl updated SPARK-9096:
--
Attachment: hanging-one-task.jpg

 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
Priority: Minor
 Attachments: ReproduceBug.java, hanging-one-task.jpg, 
 reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 
 I've reproduced this bug in the attached Java file, which I submit with 
 spark-submit. 
 The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks 
 in the count job takes a lot of time:
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 
 4659) in 708 ms on 148.251.190.217 (1597/1600)
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 
 4786) in 772 ms on 148.251.190.217 (1598/1600)
 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 
 4582) in 275019 ms on 148.251.190.217 (1599/1600)
 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 
 4430) in 407020 ms on 148.251.190.217 (1600/1600)
 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
 have all completed, from pool 
 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at 
 ReproduceBug.java:56) finished in 420.024 s
 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at 
 ReproduceBug.java:56, took 442.941395 s
 In comparison, all tasks are more or less equal in size when running the same 
 application in Spark 1.3.1. In overall, this
 attached application (ReproduceBug.java) takes about 7 minutes on Spark 
 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. 
 Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-9097:
-
Attachment: tasks are not completed.png

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Attachments: number of executor is zero.png, tasks are not 
 completed.png


 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9097) Tasks are not completed but the number of executor is zero

2015-07-16 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-9097:
-
Target Version/s: 1.5.0

 Tasks are not completed but the number of executor is zero
 --

 Key: SPARK-9097
 URL: https://issues.apache.org/jira/browse/SPARK-9097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Attachments: number of executor is zero.png, tasks are not 
 completed.png


 I set the value of spark.dynamicAllocation.enabled is true. I submit tasks 
 to run. Tasks are not completed, but the number of executor is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9044) Updated RDD name does not reflect under Storage tab

2015-07-16 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629704#comment-14629704
 ] 

Zhang, Liye commented on SPARK-9044:


Well, I think the component is correct, still it's business of Web UI

 Updated RDD name does not reflect under Storage tab
 -

 Key: SPARK-9044
 URL: https://issues.apache.org/jira/browse/SPARK-9044
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1, 1.4.0
 Environment: Mac  OSX
Reporter: Wenjie Zhang
Priority: Minor

 I was playing the spark-shell in my macbook, here is what I did:
 scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt);
 scala textFile.cache
 scala textFile.setName(test1)
 scala textFile.collect
 scala textFile.name
 res10: String = test1
 After this four commands, I can see the test1 RDD listed in the Storage 
 tab.
 However, if I continually run following commands, nothing will happen from 
 the Storage tab:
 scala textFile.setName(test2)
 scala textFile.cache
 scala textFile.collect
 scala textFile.name
 res10: String = test2
 I am expecting the name of the RDD shows in Storage tab should be test2, 
 is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9044) Updated RDD name does not reflect under Storage tab

2015-07-16 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629703#comment-14629703
 ] 

Zhang, Liye commented on SPARK-9044:


Well, I think the component is correct, still it's business of Web UI

 Updated RDD name does not reflect under Storage tab
 -

 Key: SPARK-9044
 URL: https://issues.apache.org/jira/browse/SPARK-9044
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1, 1.4.0
 Environment: Mac  OSX
Reporter: Wenjie Zhang
Priority: Minor

 I was playing the spark-shell in my macbook, here is what I did:
 scala val textFile = sc.textFile(/Users/jackzhang/Downloads/ProdPart.txt);
 scala textFile.cache
 scala textFile.setName(test1)
 scala textFile.collect
 scala textFile.name
 res10: String = test1
 After this four commands, I can see the test1 RDD listed in the Storage 
 tab.
 However, if I continually run following commands, nothing will happen from 
 the Storage tab:
 scala textFile.setName(test2)
 scala textFile.cache
 scala textFile.collect
 scala textFile.name
 res10: String = test2
 I am expecting the name of the RDD shows in Storage tab should be test2, 
 is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9096) Unevenly distributed task loads after using JavaRDD.subtract()

2015-07-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629584#comment-14629584
 ] 

Gisle Ytrestøl commented on SPARK-9096:
---

Hi, thanks for responding.

I've added a screenshot (hanging-on-task.jpg) where we see that one of the 
tasks is processing a lot of data (nearly all of it), while the other tasks 
were assigned just tiny amounts of data. This screenshot is from 1.4.0. If I 
run the same application in Spark 1.3.1, all the tasks would get roughly the 
same amount of data, and spend about the same amount of time to finish.


 Unevenly distributed task loads after using JavaRDD.subtract()
 --

 Key: SPARK-9096
 URL: https://issues.apache.org/jira/browse/SPARK-9096
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.4.0, 1.4.1
Reporter: Gisle Ytrestøl
Priority: Minor
 Attachments: ReproduceBug.java, hanging-one-task.jpg, 
 reproduce.1.3.1.log.gz, reproduce.1.4.1.log.gz


 When using JavaRDD.subtract(), it seems that the tasks are unevenly 
 distributed in the the following operations on the new JavaRDD which is 
 created by subtract. The result is that in the following operation on the 
 new JavaRDD, a few tasks process almost all the data, and these tasks will 
 take a long time to finish. 
 I've reproduced this bug in the attached Java file, which I submit with 
 spark-submit. 
 The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks 
 in the count job takes a lot of time:
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID 
 4659) in 708 ms on 148.251.190.217 (1597/1600)
 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID 
 4786) in 772 ms on 148.251.190.217 (1598/1600)
 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID 
 4582) in 275019 ms on 148.251.190.217 (1599/1600)
 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID 
 4430) in 407020 ms on 148.251.190.217 (1600/1600)
 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks 
 have all completed, from pool 
 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at 
 ReproduceBug.java:56) finished in 420.024 s
 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at 
 ReproduceBug.java:56, took 442.941395 s
 In comparison, all tasks are more or less equal in size when running the same 
 application in Spark 1.3.1. In overall, this
 attached application (ReproduceBug.java) takes about 7 minutes on Spark 
 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1. 
 Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters

2015-07-16 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629808#comment-14629808
 ] 

Manoj Kumar commented on SPARK-6001:


I just started to work on this.

 K-Means clusterer should return the assignments of input points to clusters
 ---

 Key: SPARK-6001
 URL: https://issues.apache.org/jira/browse/SPARK-6001
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: Derrick Burns
Priority: Minor

 The K-Means clusterer returns a KMeansModel that contains the cluster 
 centers. However, when available, I suggest that the K-Means clusterer also 
 return an RDD of the assignments of the input data to the clusters. While the 
 assignments can be computed given the KMeansModel, why not return assignments 
 if they are available to save re-computation costs.
 The K-means implementation at 
 https://github.com/derrickburns/generalized-kmeans-clustering returns the 
 assignments when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9104) expose network layer memory usage in shuffle part

2015-07-16 Thread Zhang, Liye (JIRA)

Zhang, Liye created SPARK-9104:
--

 Summary: expose network layer memory usage in shuffle part
 Key: SPARK-9104
 URL: https://issues.apache.org/jira/browse/SPARK-9104
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Zhang, Liye


The default network transportation is netty, and when transfering blocks for 
shuffle, the network layer will consume a decent size of memory, we shall 
collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9105) Add an additional WebUi Tab for Memory Usage

2015-07-16 Thread Zhang, Liye (JIRA)

Zhang, Liye created SPARK-9105:
--

 Summary: Add an additional WebUi Tab for Memory Usage
 Key: SPARK-9105
 URL: https://issues.apache.org/jira/browse/SPARK-9105
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Zhang, Liye


Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage 
status in different spark components. It should show the summary for each 
executors and may also the details for each tasks. On this Tab, there may be 
some duplicated information with Storage Tab, but they are in different 
showing format, take RDD cache for example, the RDD cached size showed on 
Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD 
can be indexed with Executors, or tasks. Also, the two Tabs can share some same 
Web Pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9073:
---

Assignee: (was: Apache Spark)

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9073) spark.ml Models copy() should call setParent when there is a parent

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9073:
---

Assignee: Apache Spark

 spark.ml Models copy() should call setParent when there is a parent
 ---

 Key: SPARK-9073
 URL: https://issues.apache.org/jira/browse/SPARK-9073
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 Examples with this mistake include:
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala#L119]
 * 
 [https://github.com/apache/spark/blob/9716a727fb2d11380794549039e12e53c771e120/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L220]
 Whomever writes a PR for this JIRA should check all spark.ml Model's copy() 
 methods and set copy's {{Model.parent}} when available.  Also verify in unit 
 tests (possibly in a standard method checking Models to share code).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9102) Improve project collapse with nondeterministic expressions

2015-07-16 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-9102:
--

 Summary: Improve project collapse with nondeterministic expressions
 Key: SPARK-9102
 URL: https://issues.apache.org/jira/browse/SPARK-9102
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9103) Tracking spark's memory usage

2015-07-16 Thread Zhang, Liye (JIRA)

Zhang, Liye created SPARK-9103:
--

 Summary: Tracking spark's memory usage
 Key: SPARK-9103
 URL: https://issues.apache.org/jira/browse/SPARK-9103
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core, Web UI
Reporter: Zhang, Liye


Currently spark only provides little memory usage information (RDD cache on 
webUI) for the executors. User have no idea on what is the memory consumption 
when they are running spark applications with a lot of memory used in spark 
executors. Especially when they encounter the OOM, it’s really hard to know 
what is the cause of the problem. So it would be helpful to give out the detail 
memory consumption information for each part of spark, so that user can clearly 
have a picture of where the memory is exactly used. 

The memory usage info to expose should include but not limited to shuffle, 
cache, network, serializer, etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down

2015-07-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9082:
---

Assignee: Apache Spark  (was: Wenchen Fan)

 Filter using non-deterministic expressions should not be pushed down
 

 Key: SPARK-9082
 URL: https://issues.apache.org/jira/browse/SPARK-9082
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark

 For example,
 {code}
 val df = sqlContext.range(1, 10).select($id, rand(0).as('r))
 df.as(a).join(df.filter($r  0.5).as(b), $a.id === 
 $b.id).explain(true)
 {code}
 The plan is 
 {code}
 == Physical Plan ==
 ShuffledHashJoin [id#55323L], [id#55327L], BuildRight
  Exchange (HashPartitioning 200)
   Project [id#55323L,Rand 0 AS r#55324]
PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37
  Exchange (HashPartitioning 200)
   Project [id#55327L,Rand 0 AS r#55325]
Filter (LessThan)
 PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37
 {code}
 The rand get evaluated twice instead of once. 
 This is caused by when we push down predicates we replace the attribute 
 reference in the predicate with the actual expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9082) Filter using non-deterministic expressions should not be pushed down

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629816#comment-14629816
 ] 

Apache Spark commented on SPARK-9082:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7446

 Filter using non-deterministic expressions should not be pushed down
 

 Key: SPARK-9082
 URL: https://issues.apache.org/jira/browse/SPARK-9082
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Wenchen Fan

 For example,
 {code}
 val df = sqlContext.range(1, 10).select($id, rand(0).as('r))
 df.as(a).join(df.filter($r  0.5).as(b), $a.id === 
 $b.id).explain(true)
 {code}
 The plan is 
 {code}
 == Physical Plan ==
 ShuffledHashJoin [id#55323L], [id#55327L], BuildRight
  Exchange (HashPartitioning 200)
   Project [id#55323L,Rand 0 AS r#55324]
PhysicalRDD [id#55323L], MapPartitionsRDD[42268] at range at console:37
  Exchange (HashPartitioning 200)
   Project [id#55327L,Rand 0 AS r#55325]
Filter (LessThan)
 PhysicalRDD [id#55327L], MapPartitionsRDD[42268] at range at console:37
 {code}
 The rand get evaluated twice instead of once. 
 This is caused by when we push down predicates we replace the attribute 
 reference in the predicate with the actual expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9100) DataFrame reader/writer shortcut methods for ORC

2015-07-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629752#comment-14629752
 ] 

Apache Spark commented on SPARK-9100:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7444

 DataFrame reader/writer shortcut methods for ORC
 

 Key: SPARK-9100
 URL: https://issues.apache.org/jira/browse/SPARK-9100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9059) Update Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-07-16 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629770#comment-14629770
 ] 

Benjamin Fradet commented on SPARK-9059:


I've started working on this.

 Update Direct Kafka Word count examples to show the use of HasOffsetRanges
 --

 Key: SPARK-9059
 URL: https://issues.apache.org/jira/browse/SPARK-9059
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
  Labels: starter

 Update Scala, Java and Python examples of Direct Kafka word count to access 
 the offset ranges using HasOffsetRanges and print it. For example in Scala,
  
 {code}
 var offsetRanges: Array[OffsetRange] = _
 ...
 directKafkaDStream.foreachRDD { rdd = 
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
 }
 ...
 transformedDStream.foreachRDD { rdd = 
 // some operation
 println(Processed ranges:  + offsetRanges)
 }
 {code}
 See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
 more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 270 matches

Mail list logo