[jira] [Updated] (SPARK-9611) UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter will add an empty entry to if the map is empty.

2015-08-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9611:
--
Shepherd: Josh Rosen

 UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter will add an 
 empty entry to if the map is empty.
 --

 Key: SPARK-9611
 URL: https://issues.apache.org/jira/browse/SPARK-9611
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 There are two corner cases related to the destructAndCreateExternalSorter 
 (class UnsafeKVExternalSorter) returned by UnsafeFixedWidthAggregationMap.
 1. The constructor of UnsafeKVExternalSorter tries to first create a 
 UnsafeInMemorySorter based on the BytesToBytesMap of 
 UnsafeFixedWidthAggregationMap. However, when there is no entry in the map, 
 UnsafeInMemorySorter will throw an AssertionError because we are using the 
 size of map (0 at here) as the initialSize of UnsafeInMemorySorter, which is 
 not allowed.
 2. Once we fixes the first problem, when we use UnsafeKVExternalSorter's 
 KVSorterIterator loads data back, you can find there is one extra records, 
 which is an empty record.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9119) In some cases, we may save wrong decimal values to parquet

2015-08-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9119.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7925
[https://github.com/apache/spark/pull/7925]

 In some cases, we may save wrong decimal values to parquet
 --

 Key: SPARK-9119
 URL: https://issues.apache.org/jira/browse/SPARK-9119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 {code}
  
 import org.apache.spark.sql.Row
 import 
 org.apache.spark.sql.types.{StructType,StructField,StringType,DecimalType}
 import org.apache.spark.sql.types.Decimal
 ​
 val schema = StructType(Array(StructField(name, DecimalType(10, 5), false)))
 val rowRDD = sc.parallelize(Array(Row(Decimal(67123.45
 val df = sqlContext.createDataFrame(rowRDD, schema)
 df.registerTempTable(test)
 df.show()
 ​
 // ++
 // |name|
 // ++
 // |67123.45|
 // ++
 sqlContext.sql(create table testDecimal as select * from test)
 sqlContext.table(testDecimal).show()
 // ++
 // |name|
 // ++
 // |67.12345|
 // ++
 {code}
 The problem is when we do conversions, we do not use precision/scale info in 
 the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication

2015-08-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8359.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7925
[https://github.com/apache/spark/pull/7925]

 Spark SQL Decimal type precision loss on multiplication
 ---

 Key: SPARK-8359
 URL: https://issues.apache.org/jira/browse/SPARK-8359
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Rene Treffer
Assignee: Davies Liu
 Fix For: 1.5.0


 It looks like the precision of decimal can not be raised beyond ~2^112 
 without causing full value truncation.
 The following code computes the power of two up to a specific point
 {code}
 import org.apache.spark.sql.types.Decimal
 val one = Decimal(1)
 val two = Decimal(2)
 def pow(n : Int) :  Decimal = if (n = 0) { one } else { 
   val a = pow(n - 1)
   a.changePrecision(n,0)
   two.changePrecision(n,0)
   a * two
 }
 (109 to 120).foreach(n = 
 println(pow(n).toJavaBigDecimal.unscaledValue.toString))
 649037107316853453566312041152512
 1298074214633706907132624082305024
 2596148429267413814265248164610048
 5192296858534827628530496329220096
 1038459371706965525706099265844019
 2076918743413931051412198531688038
 4153837486827862102824397063376076
 8307674973655724205648794126752152
 1661534994731144841129758825350430
 3323069989462289682259517650700860
 6646139978924579364519035301401720
 1329227995784915872903807060280344
 {code}
 Beyond ~2^112 the precision is truncated even if the precision was set to n 
 and should thus handle 10^n without problems..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9627) SQL job failed if the dataframe is cached

2015-08-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-9627:
-

 Summary: SQL job failed if the dataframe is cached
 Key: SPARK-9627
 URL: https://issues.apache.org/jira/browse/SPARK-9627
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Critical


{code}
r = random.Random()
def gen(i):
d = date.today() - timedelta(r.randint(0, 5000))
cat = str(r.randint(0, 20)) * 5
c = r.randint(0, 1000)
price = decimal.Decimal(r.randint(0, 10)) / 100
return (d, cat, c, price)

schema = StructType().add('date', DateType()).add('cat', 
StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))

#df = sqlContext.createDataFrame(sc.range(124).map(gen), schema)
#df.show()
#df.write.parquet('sales4')


df = sqlContext.read.parquet('sales4')
df.cache()
df.count()
df.show()
print df.schema
raw_input()
r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
print r.explain(True)
r.show()
{code}

{code}
StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))


== Parsed Logical Plan ==
'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS sum((count 
* price))#70]
 Relation[date#0,cat#1,count#2,price#3] 
org.apache.spark.sql.parquet.ParquetRelation@5ec8f315

== Analyzed Logical Plan ==
date: date, cat: string, sum((count * price)): decimal(21,2)
Aggregate [date#0,cat#1], 
[date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, 
DecimalType(11,2) AS sum((count * price))#70]
 Relation[date#0,cat#1,count#2,price#3] 
org.apache.spark.sql.parquet.ParquetRelation@5ec8f315

== Optimized Logical Plan ==
Aggregate [date#0,cat#1], 
[date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, 
DecimalType(11,2) AS sum((count * price))#70]
 InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
StorageLevel(true, true, false, true, 1), (PhysicalRDD 
[date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None

== Physical Plan ==
NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, 
DecimalType(11,2)2,mode=Final,isDistinct=false))
 TungstenSort [date#0 ASC,cat#1 ASC], false, 0
  ConvertToUnsafe
   Exchange hashpartitioning(date#0,cat#1)
NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
DecimalType(5,0)), DecimalType(11,2))) * change_decimal_precision(CAST(price#3, 
DecimalType(11,2)2,mode=Partial,isDistinct=false))
 TungstenSort [date#0 ASC,cat#1 ASC], false, 0
  ConvertToUnsafe
   InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
(InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
StorageLevel(true, true, false, true, 1), (PhysicalRDD 
[date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)

Code Generation: true
== RDD ==
None

15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
aborting job
Traceback (most recent call last):
  File t.py, line 34, in module
r.show()
  File /Users/davies/work/spark/python/pyspark/sql/dataframe.py, line 258, in 
show
print(self._jdf.showString(n, truncate))
  File /Users/davies/work/spark/python/lib/py4j/java_gateway.py, line 538, in 
__call__
self.target_id, self.name)
  File /Users/davies/work/spark/python/pyspark/sql/utils.py, line 36, in deco
return f(*a, **kw)
  File /Users/davies/work/spark/python/lib/py4j/protocol.py, line 300, in 
get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 
10, localhost): java.lang.UnsupportedOperationException: tail of empty list
at scala.collection.immutable.Nil$.tail(List.scala:339)
at scala.collection.immutable.Nil$.tail(List.scala:334)
at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1491)
at 
scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:2144)
at 
scala.reflect.internal.Types$TypeRef.initializedTypeParams(Types.scala:2408)
at 
scala.reflect.internal.Types$TypeRef.typeParamsMatchArgs(Types.scala:2409)
at 
scala.reflect.internal.Types$AliasTypeRef$class.dealias(Types.scala:2232)
at 
scala.reflect.internal.Types$TypeRef$$anon$3.dealias(Types.scala:2539)
 

[jira] [Resolved] (SPARK-9046) Decimal type support improvement and bug fix

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9046.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Decimal type support improvement and bug fix
 

 Key: SPARK-9046
 URL: https://issues.apache.org/jira/browse/SPARK-9046
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils

2015-08-05 Thread Yijie Shen (JIRA)
Yijie Shen created SPARK-9628:
-

 Summary: Rename Int and Long to SQLDate SQLTimestamp In 
DateTimeUtils
 Key: SPARK-9628
 URL: https://issues.apache.org/jira/browse/SPARK-9628
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9628:
---

Assignee: Apache Spark

 Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
 

 Key: SPARK-9628
 URL: https://issues.apache.org/jira/browse/SPARK-9628
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654900#comment-14654900
 ] 

Apache Spark commented on SPARK-9628:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7953

 Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
 

 Key: SPARK-9628
 URL: https://issues.apache.org/jira/browse/SPARK-9628
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9628) Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9628:
---

Assignee: (was: Apache Spark)

 Rename Int and Long to SQLDate SQLTimestamp In DateTimeUtils
 

 Key: SPARK-9628
 URL: https://issues.apache.org/jira/browse/SPARK-9628
 Project: Spark
  Issue Type: Improvement
Reporter: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6212) The EXPLAIN output of CTAS only shows the analyzed plan

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6212:
---
Assignee: Yijie Shen

 The EXPLAIN output of CTAS only shows the analyzed plan
 ---

 Key: SPARK-6212
 URL: https://issues.apache.org/jira/browse/SPARK-6212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yijie Shen

 When you try
 {code}
 sql(explain extended create table parquet2 as select * from 
 parquet1).collect.foreach(println)
 {code}
 The output will be 
 {code}
 [== Parsed Logical Plan ==]
 ['CreateTableAsSelect None, parquet2, false, Some(TOK_CREATETABLE)]
 [ 'Project [*]]
 [  'UnresolvedRelation [parquet1], None]
 []
 [== Analyzed Logical Plan ==]
 [CreateTableAsSelect [Database:default, TableName: parquet2, 
 InsertIntoHiveTable]]
 [Project [str#44]]
 [ Subquery parquet1]
 [  Relation[str#44] 
 ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format
  - 1, path - 
 file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)]
 []
 []
 [== Optimized Logical Plan ==]
 [CreateTableAsSelect [Database:default, TableName: parquet2, 
 InsertIntoHiveTable]]
 [Project [str#44]]
 [ Subquery parquet1]
 [  Relation[str#44] 
 ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format
  - 1, path - 
 file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)]
 []
 []
 [== Physical Plan ==]
 [ExecutedCommand (CreateTableAsSelect [Database:default, TableName: parquet2, 
 InsertIntoHiveTable]]
 [Project [str#44]]
 [ Subquery parquet1]
 [  Relation[str#44] 
 ParquetRelation2(List(file:/user/hive/warehouse/parquet1),Map(serialization.format
  - 1, path - 
 file:/user/hive/warehouse/parquet1),Some(StructType(StructField(str,StringType,true))),None)]
 [)]
 []
 [Code Generation: false]
 [== RDD ==]
 {code}
 Query Plans of the SELECT clause shown in Optimized Plan and Physical Plan 
 are actually analyzed plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9629) Client session timed out, have not heard from server in

2015-08-05 Thread zengqiuyang (JIRA)
zengqiuyang created SPARK-9629:
--

 Summary:  Client session timed out, have not heard from server in
 Key: SPARK-9629
 URL: https://issues.apache.org/jira/browse/SPARK-9629
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.1, 1.4.0
 Environment: spark1.4.1./make-distribution.sh --tgz 
-Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver  -Pyarn  
zookeeper-3.4.6.tar.gz 
Reporter: zengqiuyang
Priority: Critical


the spark  HA   running  every few days , then  Client session timed out 
appear。
show reconnect but not do it,  and master shutting down.
logs:
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have 
not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing 
socket connection and attempting reconnect
15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED
15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
ConnectionStateListeners registered.
15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to 
server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL 
(unknown error)
15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to 
h5/192.168.0.18:2181, initiating session
15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete on 
server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated timeout 
= 4
15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED
15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
ConnectionStateListeners registered.
15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have not 
heard from server in 37753ms for sessionid 0x34ee39684b70006, closing socket 
connection and attempting reconnect
15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED
15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
leadership
15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master 
shutting down.
15/08/05 05:32:58 INFO util.Utils: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9617) Implement json_tuple

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9617:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-9571

 Implement json_tuple
 

 Key: SPARK-9617
 URL: https://issues.apache.org/jira/browse/SPARK-9617
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nathan Howell
Priority: Minor

 Provide a native Spark implementation for {{json_tuple}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9618:
---
Assignee: Nathan Howell
Target Version/s: 1.5.0

 SQLContext.read.schema().parquet() ignores the supplied schema
 --

 Key: SPARK-9618
 URL: https://issues.apache.org/jira/browse/SPARK-9618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Nathan Howell
Assignee: Nathan Howell
Priority: Minor

 If a user supplies a schema when loading a Parquet file it is ignored and the 
 schema is read off disk instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-08-05 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654906#comment-14654906
 ] 

Mike Dusenberry commented on SPARK-6488:


I'd like to work on this one as well.

 Support addition/multiplication in PySpark's BlockMatrix
 

 Key: SPARK-6488
 URL: https://issues.apache.org/jira/browse/SPARK-6488
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
 should reuse the Scala implementation instead of having a separate 
 implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions

2015-08-05 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Summary: Throw a AnalysisException with meaningful messages when 
DataFrame#explode takes a star in expressions  (was: Support a star '*' in 
generator function arguments)

 Throw a AnalysisException with meaningful messages when DataFrame#explode 
 takes a star in expressions
 -

 Key: SPARK-8930
 URL: https://issues.apache.org/jira/browse/SPARK-8930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation throws an exception if generators contain a star 
 '*' like codes blow;
 {code}
 val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, 
 csv)
 checkAnswer(
   df.explode($*) { case Row(prefix: String, csv: String) =
 csv.split(,).map(v = Tuple1(prefix + : + v))
   },
   Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2)
 :: Row(2, 4, 2:4)
 :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 
 7,8,9, 3:9)
 :: Nil
 )
 {code}
 {code}
 [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
 [info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
 input columns prefix, csv;
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
 21)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 [info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages when DataFrame#explode takes a star in expressions

2015-08-05 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Description: 
The current implementation throws an exception with meaningless messages if 
DataFrame#explode contain a star '*' like codes blow;

{code}
val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv)
checkAnswer(
  df.explode($*) { case Row(prefix: String, csv: String) =
csv.split(,).map(v = Tuple1(prefix + : + v))
  },
  Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2)
:: Row(2, 4, 2:4)
:: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 
7,8,9, 3:9)
:: Nil
)
{code}

{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}

  was:
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

{code}
val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv)
checkAnswer(
  df.explode($*) { case Row(prefix: String, csv: String) =
csv.split(,).map(v = Tuple1(prefix + : + v))
  },
  Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2)
:: Row(2, 4, 2:4)
:: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 
7,8,9, 3:9)
:: Nil
)
{code}

{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}


 Throw a AnalysisException with meaningful messages when DataFrame#explode 
 takes a star in expressions
 

[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions

2015-08-05 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Summary: Throw a AnalysisException with meaningful messages if 
DataFrame#explode takes a star in expressions  (was: Throw a AnalysisException 
with meaningful messages when DataFrame#explode takes a star in expressions)

 Throw a AnalysisException with meaningful messages if DataFrame#explode takes 
 a star in expressions
 ---

 Key: SPARK-8930
 URL: https://issues.apache.org/jira/browse/SPARK-8930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation throws an exception with meaningless messages if 
 DataFrame#explode contain a star '*' like codes blow;
 {code}
 val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, 
 csv)
 checkAnswer(
   df.explode($*) { case Row(prefix: String, csv: String) =
 csv.split(,).map(v = Tuple1(prefix + : + v))
   },
   Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2)
 :: Row(2, 4, 2:4)
 :: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 
 7,8,9, 3:9)
 :: Nil
 )
 {code}
 {code}
 [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
 [info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
 input columns prefix, csv;
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
 [info]   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
 21)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 [info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions

2015-08-05 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Description: 
The current implementation throws an exception with meaningless messages if 
DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in 
expressions) like codes blow;

{code}
val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv)
  df.explode($*) { case Row(prefix: String, csv: String) =
csv.split(,).map(v = Tuple1(prefix + : + v))
  }
{code}

{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
{code}

  was:
The current implementation throws an exception with meaningless messages if 
DataFrame#explode contain a star '*' like codes blow;

{code}
val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, csv)
checkAnswer(
  df.explode($*) { case Row(prefix: String, csv: String) =
csv.split(,).map(v = Tuple1(prefix + : + v))
  },
  Row(1, 1,2, 1:1) :: Row(1, 1,2, 1:2)
:: Row(2, 4, 2:4)
:: Row(3, 7,8,9, 3:7) :: Row(3, 7,8,9, 3:8) :: Row(3, 
7,8,9, 3:9)
:: Nil
)
{code}

{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}


 Throw a AnalysisException with meaningful messages if DataFrame#explode takes 
 a star in expressions
 ---

 Key: SPARK-8930
 URL: https://issues.apache.org/jira/browse/SPARK-8930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation throws an exception with meaningless messages if 
 DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in 
 expressions) like codes blow;
 {code}
 val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, 
 csv)
   df.explode($*) { case Row(prefix: String, csv: String) =
 csv.split(,).map(v = Tuple1(prefix + : + v))
   }
 {code}
 {code}
 [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
 [info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
 input columns prefix, csv;
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 {code}



--
This 

[jira] [Updated] (SPARK-8930) Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8930:
---
Target Version/s: 1.5.0

 Throw a AnalysisException with meaningful messages if DataFrame#explode takes 
 a star in expressions
 ---

 Key: SPARK-8930
 URL: https://issues.apache.org/jira/browse/SPARK-8930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation throws an exception with meaningless messages if 
 DataFrame#explode contain a star '*' (ISTM that explode cannot take a star in 
 expressions) like codes blow;
 {code}
 val df = Seq((1, 1,2), (2, 4), (3, 7,8,9)).toDF(prefix, 
 csv)
   df.explode($*) { case Row(prefix: String, csv: String) =
 csv.split(,).map(v = Tuple1(prefix + : + v))
   }
 {code}
 {code}
 [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
 [info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
 input columns prefix, csv;
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9629) Client session timed out, have not heard from server in

2015-08-05 Thread zengqiuyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zengqiuyang updated SPARK-9629:
---
Environment: 
spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 
-Dyarn.version=2.5.2 -Phive -Phive-thriftserver  -Pyarn  
zookeeper-3.4.6.tar.gz 
standalone HA
Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc 
version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 UTC 
2013

  was:
spark1.4.1./make-distribution.sh --tgz -Dhadoop.version=2.5.2 
-Dyarn.version=2.5.2 -Phive -Phive-thriftserver  -Pyarn  
zookeeper-3.4.6.tar.gz 
standalone HA


  Client session timed out, have not heard from server in
 

 Key: SPARK-9629
 URL: https://issues.apache.org/jira/browse/SPARK-9629
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0, 1.4.1
 Environment: spark1.4.1./make-distribution.sh --tgz 
 -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver  
 -Pyarn  
 zookeeper-3.4.6.tar.gz 
 standalone HA
 Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc 
 version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 
 UTC 2013
Reporter: zengqiuyang
Priority: Critical

 the spark  HA   running  every few days , then  Client session timed out 
 appear。
 show reconnect but not do it,  and master shutting down.
 logs:
  15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have 
 not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing 
 socket connection and attempting reconnect
 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED
 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to 
 server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL 
 (unknown error)
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to 
 h5/192.168.0.18:2181, initiating session
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete 
 on server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated 
 timeout = 4
 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED
 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have 
 not heard from server in 37753ms for sessionid 0x34ee39684b70006, closing 
 socket connection and attempting reconnect
 15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED
 15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
 leadership
 15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master 
 shutting down.
 15/08/05 05:32:58 INFO util.Utils: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9630) Cleanup Hybrid Aggregate Operator.

2015-08-05 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9630:
---

 Summary: Cleanup Hybrid Aggregate Operator.
 Key: SPARK-9630
 URL: https://issues.apache.org/jira/browse/SPARK-9630
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker


This is the follow-up of SPARK-9240 to address review comments and clean up 
code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9630) Cleanup Hybrid Aggregate Operator.

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9630:
---

Assignee: Apache Spark  (was: Yin Huai)

 Cleanup Hybrid Aggregate Operator.
 --

 Key: SPARK-9630
 URL: https://issues.apache.org/jira/browse/SPARK-9630
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker

 This is the follow-up of SPARK-9240 to address review comments and clean up 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9630) Cleanup Hybrid Aggregate Operator.

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9630:
---

Assignee: Yin Huai  (was: Apache Spark)

 Cleanup Hybrid Aggregate Operator.
 --

 Key: SPARK-9630
 URL: https://issues.apache.org/jira/browse/SPARK-9630
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 This is the follow-up of SPARK-9240 to address review comments and clean up 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9630) Cleanup Hybrid Aggregate Operator.

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654924#comment-14654924
 ] 

Apache Spark commented on SPARK-9630:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7954

 Cleanup Hybrid Aggregate Operator.
 --

 Key: SPARK-9630
 URL: https://issues.apache.org/jira/browse/SPARK-9630
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 This is the follow-up of SPARK-9240 to address review comments and clean up 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9240) Hybrid aggregate operator using unsafe row

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654925#comment-14654925
 ] 

Apache Spark commented on SPARK-9240:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7954

 Hybrid aggregate operator using unsafe row
 --

 Key: SPARK-9240
 URL: https://issues.apache.org/jira/browse/SPARK-9240
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 We need a hybrid aggregate operator, which first tries hash-based 
 aggregations and gracefully switch to sort-based aggregations if the hash 
 map's memory footprint exceeds a given threshold (how to track memory 
 footprint and how to set the threshold?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9631) Giant pile of parquet log when trying to read local data

2015-08-05 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9631:
--

 Summary: Giant pile of parquet log when trying to read local data
 Key: SPARK-9631
 URL: https://issues.apache.org/jira/browse/SPARK-9631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian


When I read a Parquet file, I got the following

{code}
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:36 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:36 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:53 AM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: 
Can not initialize counter due to context is not a instance of 
TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized 
will read a total of 2097152 records.
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
block
Aug 5, 2015 12:13:53 AM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 
0 ms. row count = 2097152
Aug 5, 2015 12:13:53 AM INFO: 

[jira] [Commented] (SPARK-9631) Giant pile of parquet log when trying to read local data

2015-08-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654933#comment-14654933
 ] 

Reynold Xin commented on SPARK-9631:


FYI this was running PySpark.


 Giant pile of parquet log when trying to read local data
 

 Key: SPARK-9631
 URL: https://issues.apache.org/jira/browse/SPARK-9631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian

 When I read a Parquet file, I got the following
 {code}
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at 

[jira] [Resolved] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2015-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9215.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Implement WAL-free Kinesis receiver that give at-least once guarantee
 -

 Key: SPARK-9215
 URL: https://issues.apache.org/jira/browse/SPARK-9215
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0


 Currently, the KinesisReceiver can loose some data in the case of certain 
 failures (receiver and driver failures). Using the write ahead logs can 
 mitigate some of the problem, but it is not ideal because WALs dont work with 
 S3 (eventually consistency, etc.) which is the most likely file system to be 
 used in the EC2 environment. Hence, we have to take a different approach to 
 improving reliability for Kinesis.
 Detailed design doc - 
 https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9217) Update Kinesis Receiver to record sequence numbers

2015-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9217.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Update Kinesis Receiver to record sequence numbers
 --

 Key: SPARK-9217
 URL: https://issues.apache.org/jira/browse/SPARK-9217
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6599) Improve usability and reliability of Kinesis stream

2015-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-6599:
-
Issue Type: Epic  (was: Improvement)

 Improve usability and reliability of Kinesis stream
 ---

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Epic
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Usability improvements: 
 API improvements, AWS SDK upgrades, etc.
 Reliability improvements:
 Currently, the KinesisReceiver can loose some data in the case of certain 
 failures (receiver and driver failures). Using the write ahead logs can 
 mitigate some of the problem, but it is not ideal because WALs dont work with 
 S3 (eventually consistency, etc.) which is the most likely file system to be 
 used in the EC2 environment. Hence, we have to take a different approach to 
 improving reliability for Kinesis. See 
 https://issues.apache.org/jira/browse/SPARK-9215 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6599) Improve usability and reliability of Kinesis stream

2015-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-6599:
-
Issue Type: Umbrella  (was: Epic)

 Improve usability and reliability of Kinesis stream
 ---

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Umbrella
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Usability improvements: 
 API improvements, AWS SDK upgrades, etc.
 Reliability improvements:
 Currently, the KinesisReceiver can loose some data in the case of certain 
 failures (receiver and driver failures). Using the write ahead logs can 
 mitigate some of the problem, but it is not ideal because WALs dont work with 
 S3 (eventually consistency, etc.) which is the most likely file system to be 
 used in the EC2 environment. Hence, we have to take a different approach to 
 improving reliability for Kinesis. See 
 https://issues.apache.org/jira/browse/SPARK-9215 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9632) update InternalRow.toSeq to make it accept data type info

2015-08-05 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9632:
--

 Summary: update InternalRow.toSeq to make it accept data type info
 Key: SPARK-9632
 URL: https://issues.apache.org/jira/browse/SPARK-9632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9632) update InternalRow.toSeq to make it accept data type info

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9632:
---

Assignee: (was: Apache Spark)

 update InternalRow.toSeq to make it accept data type info
 -

 Key: SPARK-9632
 URL: https://issues.apache.org/jira/browse/SPARK-9632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9632) update InternalRow.toSeq to make it accept data type info

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654942#comment-14654942
 ] 

Apache Spark commented on SPARK-9632:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7955

 update InternalRow.toSeq to make it accept data type info
 -

 Key: SPARK-9632
 URL: https://issues.apache.org/jira/browse/SPARK-9632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9632) update InternalRow.toSeq to make it accept data type info

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9632:
---

Assignee: Apache Spark

 update InternalRow.toSeq to make it accept data type info
 -

 Key: SPARK-9632
 URL: https://issues.apache.org/jira/browse/SPARK-9632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9631) Giant pile of parquet log when trying to read local data

2015-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654953#comment-14654953
 ] 

Sean Owen commented on SPARK-9631:
--

Is this fixed by https://issues.apache.org/jira/browse/SPARK-8118 ? or supposed 
to be? might be the same report either way.

 Giant pile of parquet log when trying to read local data
 

 Key: SPARK-9631
 URL: https://issues.apache.org/jira/browse/SPARK-9631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian

 When I read a Parquet file, I got the following
 {code}
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 

[jira] [Resolved] (SPARK-9581) Add test for JSON UDTs

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9581.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Add test for JSON UDTs
 --

 Key: SPARK-9581
 URL: https://issues.apache.org/jira/browse/SPARK-9581
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9629) Client session timed out, have not heard from server in

2015-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654959#comment-14654959
 ] 

Sean Owen commented on SPARK-9629:
--

This points to a problem with your ZK broker. Have you investigated that first?

  Client session timed out, have not heard from server in
 

 Key: SPARK-9629
 URL: https://issues.apache.org/jira/browse/SPARK-9629
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0, 1.4.1
 Environment: spark1.4.1./make-distribution.sh --tgz 
 -Dhadoop.version=2.5.2 -Dyarn.version=2.5.2 -Phive -Phive-thriftserver  
 -Pyarn  
 zookeeper-3.4.6.tar.gz 
 standalone HA
 Linux version 2.6.32-358.el6.x86_64 (mockbu...@c6b8.bsys.dev.centos.org) (gcc 
 version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Feb 22 00:31:26 
 UTC 2013
Reporter: zengqiuyang
Priority: Critical

 the spark  HA   running  every few days , then  Client session timed out 
 appear。
 show reconnect but not do it,  and master shutting down.
 logs:
  15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Client session timed out, have 
 not heard from server in 37753ms for sessionid 0x34ee39684b70005, closing 
 socket connection and attempting reconnect
 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: SUSPENDED
 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Opening socket connection to 
 server h5/192.168.0.18:2181. Will not attempt to authenticate using SASL 
 (unknown error)
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Socket connection established to 
 h5/192.168.0.18:2181, initiating session
 15/08/05 05:32:57 INFO zookeeper.ClientCnxn: Session establishment complete 
 on server h5/192.168.0.18:2181, sessionid = 0x34ee39684b70005, negotiated 
 timeout = 4
 15/08/05 05:32:57 INFO state.ConnectionStateManager: State change: RECONNECTED
 15/08/05 05:32:57 WARN state.ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 15/08/05 05:32:58 INFO zookeeper.ClientCnxn: Client session timed out, have 
 not heard from server in 37753ms for sessionid 0x34ee39684b70006, closing 
 socket connection and attempting reconnect
 15/08/05 05:32:58 INFO state.ConnectionStateManager: State change: SUSPENDED
 15/08/05 05:32:58 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
 leadership
 15/08/05 05:32:58 ERROR master.Master: Leadership has been revoked -- master 
 shutting down.
 15/08/05 05:32:58 INFO util.Utils: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9625:
-
Component/s: SQL

Did you say this is reproducible -- how do you do it?

 SparkILoop creates sql context continuously, thousands of times
 ---

 Key: SPARK-9625
 URL: https://issues.apache.org/jira/browse/SPARK-9625
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql

 Occasionally but repeatably, based on the Spark SQL operations being run, 
 {{spark-shell}} gets into a funk where it attempts to create a sql context 
 over and over again as it is doing its work. Example output below:
 {code}
 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, 
 ResultStage 8)
 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: 
 List()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: 
 List(ShuffleMapStage 7)
 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 
 (MapPartitionsRDD[49] at map at console:474), which is now runnable
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with 
 curMem=685306, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in 
 memory (estimated size 46.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with 
 curMem=733146, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes 
 in memory (estimated size 14.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory 
 on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at 
 DAGScheduler.scala:874
 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from 
 ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474)
 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 
 684, localhost, PROCESS_LOCAL, 1461 bytes)
 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty 
 blocks out of 214 blocks
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
 in 1 ms
 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since 
 config is empty
 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL 

[jira] [Closed] (SPARK-9631) Giant pile of parquet log when trying to read local data

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-9631.
--
Resolution: Duplicate

 Giant pile of parquet log when trying to read local data
 

 Key: SPARK-9631
 URL: https://issues.apache.org/jira/browse/SPARK-9631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian

 When I read a Parquet file, I got the following
 {code}
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:36 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory 
 in 0 ms. row count = 2097152
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM WARNING: 
 org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due 
 to context is not a instance of TaskInputOutputContext, but is 
 org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader 
 initialized will read a total of 2097152 records.
 Aug 5, 2015 12:13:53 AM INFO: 
 org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next 
 block
 Aug 5, 2015 12:13:53 AM INFO: 
 

[jira] [Updated] (SPARK-9627) SQL job failed if the dataframe is cached

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9627:
---
Target Version/s: 1.5.0

 SQL job failed if the dataframe is cached
 -

 Key: SPARK-9627
 URL: https://issues.apache.org/jira/browse/SPARK-9627
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Critical

 {code}
 r = random.Random()
 def gen(i):
 d = date.today() - timedelta(r.randint(0, 5000))
 cat = str(r.randint(0, 20)) * 5
 c = r.randint(0, 1000)
 price = decimal.Decimal(r.randint(0, 10)) / 100
 return (d, cat, c, price)
 schema = StructType().add('date', DateType()).add('cat', 
 StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))
 #df = sqlContext.createDataFrame(sc.range(124).map(gen), schema)
 #df.show()
 #df.write.parquet('sales4')
 df = sqlContext.read.parquet('sales4')
 df.cache()
 df.count()
 df.show()
 print df.schema
 raw_input()
 r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
 print r.explain(True)
 r.show()
 {code}
 {code}
 StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))
 == Parsed Logical Plan ==
 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS 
 sum((count * price))#70]
  Relation[date#0,cat#1,count#2,price#3] 
 org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
 == Analyzed Logical Plan ==
 date: date, cat: string, sum((count * price)): decimal(21,2)
 Aggregate [date#0,cat#1], 
 [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
 DecimalType(5,0)), DecimalType(11,2))) * 
 change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
 price))#70]
  Relation[date#0,cat#1,count#2,price#3] 
 org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
 == Optimized Logical Plan ==
 Aggregate [date#0,cat#1], 
 [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
 DecimalType(5,0)), DecimalType(11,2))) * 
 change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
 price))#70]
  InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
 StorageLevel(true, true, false, true, 1), (PhysicalRDD 
 [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None
 == Physical Plan ==
 NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
 ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
 DecimalType(5,0)), DecimalType(11,2))) * 
 change_decimal_precision(CAST(price#3, 
 DecimalType(11,2)2,mode=Final,isDistinct=false))
  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
   ConvertToUnsafe
Exchange hashpartitioning(date#0,cat#1)
 NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
 ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
 DecimalType(5,0)), DecimalType(11,2))) * 
 change_decimal_precision(CAST(price#3, 
 DecimalType(11,2)2,mode=Partial,isDistinct=false))
  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
   ConvertToUnsafe
InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
 (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
 StorageLevel(true, true, false, true, 1), (PhysicalRDD 
 [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)
 Code Generation: true
 == RDD ==
 None
 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
 aborting job
 Traceback (most recent call last):
   File t.py, line 34, in module
 r.show()
   File /Users/davies/work/spark/python/pyspark/sql/dataframe.py, line 258, 
 in show
 print(self._jdf.showString(n, truncate))
   File /Users/davies/work/spark/python/lib/py4j/java_gateway.py, line 538, 
 in __call__
 self.target_id, self.name)
   File /Users/davies/work/spark/python/pyspark/sql/utils.py, line 36, in 
 deco
 return f(*a, **kw)
   File /Users/davies/work/spark/python/lib/py4j/protocol.py, line 300, in 
 get_return_value
 format(target_id, '.', name), value)
 py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
 (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty 
 list
   at scala.collection.immutable.Nil$.tail(List.scala:339)
   at scala.collection.immutable.Nil$.tail(List.scala:334)
   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
   at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1491)
   at 
 scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:2144)
   at 
 scala.reflect.internal.Types$TypeRef.initializedTypeParams(Types.scala:2408)
 

[jira] [Resolved] (SPARK-9621) Closure inside RDD doesn't properly close over environment

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9621.
--
Resolution: Duplicate

Pretty sure this is a subset of the general problem of using case classes in 
the shell. They don't end up being the same class when used this way. I don't 
know if it's a Scala shell thing or not, and I am not aware of a solution other 
than don't use case classes in the shell

 Closure inside RDD doesn't properly close over environment
 --

 Key: SPARK-9621
 URL: https://issues.apache.org/jira/browse/SPARK-9621
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.1
 Environment: Ubuntu 15.04, spark-1.4.1-bin-hadoop2.6 package
Reporter: Joe Near

 I expect the following:
 case class MyTest(i: Int)
 val tv = MyTest(1)
 val res = sc.parallelize(Array((t: MyTest) = t == tv)).first()(tv)
 to be true. It is false, when I type this into spark-shell. It seems the 
 closure is changed somehow when it's serialized and deserialized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9601) Join example fix in streaming-programming-guide.md

2015-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9601.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Join example fix in streaming-programming-guide.md
 --

 Key: SPARK-9601
 URL: https://issues.apache.org/jira/browse/SPARK-9601
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Jayant Shekhar
Priority: Trivial
 Fix For: 1.5.0


 Stream-Stream Join has the following signature for Java in the guide:
 JavaPairDStreamString, String joinedStream = stream1.join(stream2);
 It should be:
 JavaPairDStreamString, Tuple2String, String joinedStream = 
 stream1.join(stream2);
 Same for windowed stream join. It should be:
 JavaPairDStreamString, Tuple2String, String joinedStream = 
 windowedStream1.join(windowedStream2);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9601) Join example fix in streaming-programming-guide.md

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9601:
-
Assignee: Namit Katariya

 Join example fix in streaming-programming-guide.md
 --

 Key: SPARK-9601
 URL: https://issues.apache.org/jira/browse/SPARK-9601
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Jayant Shekhar
Assignee: Namit Katariya
Priority: Trivial
 Fix For: 1.5.0


 Stream-Stream Join has the following signature for Java in the guide:
 JavaPairDStreamString, String joinedStream = stream1.join(stream2);
 It should be:
 JavaPairDStreamString, Tuple2String, String joinedStream = 
 stream1.join(stream2);
 Same for windowed stream join. It should be:
 JavaPairDStreamString, Tuple2String, String joinedStream = 
 windowedStream1.join(windowedStream2);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode

2015-08-05 Thread kumar deepak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655002#comment-14655002
 ] 

kumar deepak commented on SPARK-6107:
-

Is there a plan to fix it in 1.3.1

 event log file ends with .inprogress should be able to display on webUI for 
 standalone mode
 ---

 Key: SPARK-6107
 URL: https://issues.apache.org/jira/browse/SPARK-6107
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1
Reporter: Zhang, Liye
Assignee: Zhang, Liye
 Fix For: 1.4.0


 when application is finished running abnormally (Ctrl + c for example), the 
 history event log file is still ends with *.inprogress* suffix. And the 
 application state can not be showed on webUI, User can just see *Application 
 history not foud , Application xxx is still in progress*.  
 User should also can see the status of the abnormal finished applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Sean Owen (JIRA)
Sean Owen created SPARK-9633:


 Summary: SBT download locations outdated; need an update
 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.4.1, 1.3.1, 1.5.0
Reporter: Sean Owen
Priority: Minor


The SBT download script tries to download from two locations, 
typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
only place to download SBT at this point. We should update to reference bintray 
directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9633:
-
Description: 
The SBT download script tries to download from two locations, 
typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
only place to download SBT at this point. We should update to reference bintray 
directly.

PS: we should download SBT over HTTPS too, not HTTP

  was:The SBT download script tries to download from two locations, 
typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
only place to download SBT at this point. We should update to reference bintray 
directly.


 SBT download locations outdated; need an update
 ---

 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Sean Owen
Priority: Minor

 The SBT download script tries to download from two locations, 
 typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
 the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
 only place to download SBT at this point. We should update to reference 
 bintray directly.
 PS: we should download SBT over HTTPS too, not HTTP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8862) Add a web UI page that visualizes physical plans (SparkPlan)

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8862.

   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.5.0

 Add a web UI page that visualizes physical plans (SparkPlan)
 

 Key: SPARK-8862
 URL: https://issues.apache.org/jira/browse/SPARK-8862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 We currently have the ability to visualize part of the query plan using the 
 Spark DAG viz. However, that does NOT work for one of the most important 
 operators: broadcast join. The reason is that broadcast join launches 
 multiple Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8861) Add basic instrumentation to each SparkPlan operator

2015-08-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8861.

   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.5.0

 Add basic instrumentation to each SparkPlan operator
 

 Key: SPARK-8861
 URL: https://issues.apache.org/jira/browse/SPARK-8861
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 The basic metric can be the number of tuples that is flowing through. We can 
 add more metrics later.
 In order for this to work, we can add a new accumulators method to 
 SparkPlan that defines the list of accumulators, .e.g.
 {code}
   def accumulators: Map[String, Accumulator]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9633:
---

Assignee: (was: Apache Spark)

 SBT download locations outdated; need an update
 ---

 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Sean Owen
Priority: Minor

 The SBT download script tries to download from two locations, 
 typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
 the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
 only place to download SBT at this point. We should update to reference 
 bintray directly.
 PS: we should download SBT over HTTPS too, not HTTP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655027#comment-14655027
 ] 

Apache Spark commented on SPARK-9633:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7956

 SBT download locations outdated; need an update
 ---

 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Sean Owen
Priority: Minor

 The SBT download script tries to download from two locations, 
 typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
 the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
 only place to download SBT at this point. We should update to reference 
 bintray directly.
 PS: we should download SBT over HTTPS too, not HTTP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9633:
---

Assignee: Apache Spark

 SBT download locations outdated; need an update
 ---

 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Sean Owen
Assignee: Apache Spark
Priority: Minor

 The SBT download script tries to download from two locations, 
 typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
 the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
 only place to download SBT at this point. We should update to reference 
 bintray directly.
 PS: we should download SBT over HTTPS too, not HTTP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9633) SBT download locations outdated; need an update

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9633:
-
Assignee: Sean Owen

(Assigning to me as I don't yet see that nraychaudhuri has a JIRA username)

 SBT download locations outdated; need an update
 ---

 Key: SPARK-9633
 URL: https://issues.apache.org/jira/browse/SPARK-9633
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 The SBT download script tries to download from two locations, 
 typesafe.artifactoryonline.com and repo.typesafe.com. The former is offline; 
 the latter redirects to dl.bintray.com now. In fact, bintray seems like the 
 only place to download SBT at this point. We should update to reference 
 bintray directly.
 PS: we should download SBT over HTTPS too, not HTTP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve

2015-08-05 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9634:
--

 Summary: resolve UnresolvedAlias in DataFrame.resolve
 Key: SPARK-9634
 URL: https://issues.apache.org/jira/browse/SPARK-9634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655044#comment-14655044
 ] 

Apache Spark commented on SPARK-9634:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7957

 resolve UnresolvedAlias in DataFrame.resolve
 

 Key: SPARK-9634
 URL: https://issues.apache.org/jira/browse/SPARK-9634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9634:
---

Assignee: Apache Spark

 resolve UnresolvedAlias in DataFrame.resolve
 

 Key: SPARK-9634
 URL: https://issues.apache.org/jira/browse/SPARK-9634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9634) resolve UnresolvedAlias in DataFrame.resolve

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9634:
---

Assignee: (was: Apache Spark)

 resolve UnresolvedAlias in DataFrame.resolve
 

 Key: SPARK-9634
 URL: https://issues.apache.org/jira/browse/SPARK-9634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9323:
---

Assignee: Apache Spark

 DataFrame.orderBy gives confusing analysis errors when ordering based on 
 nested columns
 ---

 Key: SPARK-9323
 URL: https://issues.apache.org/jira/browse/SPARK-9323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark

 The following two queries should be equivalent, but the second crashes:
 {code}
 sqlContext.read.json(sqlContext.sparkContext.makeRDD(
 {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil))
   .registerTempTable(nestedOrder)
checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1))
checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), 
 Row(1))
 {code}
 Here's the stacktrace:
 {code}
 Cannot resolve column name a.b among (b);
 org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b 
 among (b);
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
   at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
   at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
 {code}
 Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls 
 {{resolveQuoted}}, causing the nested field to be treated as a single field 
 named {{a.b}}.
 UPDATE: here's a shorter one-liner reproduction:
 {code}
 val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: 
 {b: 1}} :: Nil))
 checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9323) DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9323:
---

Assignee: (was: Apache Spark)

 DataFrame.orderBy gives confusing analysis errors when ordering based on 
 nested columns
 ---

 Key: SPARK-9323
 URL: https://issues.apache.org/jira/browse/SPARK-9323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Josh Rosen

 The following two queries should be equivalent, but the second crashes:
 {code}
 sqlContext.read.json(sqlContext.sparkContext.makeRDD(
 {a: {b: 1, a: {a: 1}}, c: [{d: 1}]} :: Nil))
   .registerTempTable(nestedOrder)
checkAnswer(sql(SELECT a.b FROM nestedOrder ORDER BY a.b), Row(1))
checkAnswer(sql(select * from nestedOrder).select(a.b).orderBy(a.b), 
 Row(1))
 {code}
 Here's the stacktrace:
 {code}
 Cannot resolve column name a.b among (b);
 org.apache.spark.sql.AnalysisException: Cannot resolve column name a.b 
 among (b);
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
   at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
   at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
 {code}
 Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls 
 {{resolveQuoted}}, causing the nested field to be treated as a single field 
 named {{a.b}}.
 UPDATE: here's a shorter one-liner reproduction:
 {code}
 val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD({a: 
 {b: 1}} :: Nil))
 checkAnswer(df.select(a.b).filter(a.b = a.b), Row(1))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9607) Incorrect zinc check in build/mvn

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9607:
-
Assignee: Ryan Williams

 Incorrect zinc check in build/mvn
 -

 Key: SPARK-9607
 URL: https://issues.apache.org/jira/browse/SPARK-9607
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.1
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor

 [This 
 check|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L84-L85]
  in {{build/mvn}} attempts to determine whether {{zinc}} has been installed, 
 but it fails to add the prefix {{build/}} to the path, so it always thinks 
 that {{zinc}} is not installed, sets {{ZINC_INSTALL_FLAG}} to {{1}}, and 
 attempts to install {{zinc}}.
 This error manifests later because [the {{zinc -shutdown}} and {{zinc 
 -start}} 
 commands|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L140-L143]
  are always run, even if zinc was not installed and is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9607) Incorrect zinc check in build/mvn

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9607.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.3.2
   1.4.2

Issue resolved by pull request 7944
[https://github.com/apache/spark/pull/7944]

 Incorrect zinc check in build/mvn
 -

 Key: SPARK-9607
 URL: https://issues.apache.org/jira/browse/SPARK-9607
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.1
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor
 Fix For: 1.4.2, 1.3.2, 1.5.0


 [This 
 check|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L84-L85]
  in {{build/mvn}} attempts to determine whether {{zinc}} has been installed, 
 but it fails to add the prefix {{build/}} to the path, so it always thinks 
 that {{zinc}} is not installed, sets {{ZINC_INSTALL_FLAG}} to {{1}}, and 
 attempts to install {{zinc}}.
 This error manifests later because [the {{zinc -shutdown}} and {{zinc 
 -start}} 
 commands|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L140-L143]
  are always run, even if zinc was not installed and is running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9608) Incorrect zinc -status check in build/mvn

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9608.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.3.2
   1.4.2

Issue resolved by pull request 7944
[https://github.com/apache/spark/pull/7944]

 Incorrect zinc -status check in build/mvn
 -

 Key: SPARK-9608
 URL: https://issues.apache.org/jira/browse/SPARK-9608
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.1
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor
 Fix For: 1.4.2, 1.3.2, 1.5.0


 {{build/mvn}} [uses a {{-z `zinc -status`}} 
 test|https://github.com/apache/spark/blob/5a23213c148bfe362514f9c71f5273ebda0a848a/build/mvn#L138]
  to determine whether a {{zinc}} process is running.
 However, {{zinc -status}} checks port {{3030}} by default.
 This means that if a {{$ZINC_PORT}} env var is set to some value besides 
 {{3030}}, and an existing {{zinc}} process is running on port {{3030}}, 
 {{build/mvn}} will skip starting a {{zinc}} process, thinking that a suitable 
 one is running.
 Subsequent compilations will look for a {{zinc}} at port {{$ZINC_PORT}} and 
 not find one.
 The {{zinc -status}} call should get the flag {{-port $ZINC_PORT}} added to 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-05 Thread Philipp Angerer (JIRA)
Philipp Angerer created SPARK-9636:
--

 Summary: Treat $SPARK_HOME as write-only
 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer


when starting spark scripts as user and it is installed in a directory the user 
has no write permissions on, many things work fine, except for the logs (e.g. 
for {{start-master.sh}})

logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
{{$SPARK_HOME/logs}}.

if installed in this way, it should, instead of throwing an error, write logs 
to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs in 
sequence for writability before trying to use one. i suggest using 
{{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} → 
{{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9636:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 Treat $SPARK_HOME as write-only
 ---

 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer
Priority: Minor
  Labels: easyfix

 when starting spark scripts as user and it is installed in a directory the 
 user has no write permissions on, many things work fine, except for the logs 
 (e.g. for {{start-master.sh}})
 logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
 {{$SPARK_HOME/logs}}.
 if installed in this way, it should, instead of throwing an error, write logs 
 to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
 in sequence for writability before trying to use one. i suggest using 
 {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
 → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655146#comment-14655146
 ] 

Sean Owen commented on SPARK-9636:
--

I'm not sure those are as obvious as defaults, or necessarily have write 
permission either. Isn't the solution that {{SPARK_LOG_DIR}} should be set if 
needed?

 Treat $SPARK_HOME as write-only
 ---

 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer
  Labels: easyfix

 when starting spark scripts as user and it is installed in a directory the 
 user has no write permissions on, many things work fine, except for the logs 
 (e.g. for {{start-master.sh}})
 logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
 {{$SPARK_HOME/logs}}.
 if installed in this way, it should, instead of throwing an error, write logs 
 to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
 in sequence for writability before trying to use one. i suggest using 
 {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
 → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment

2015-08-05 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-9637:
--

 Summary: Add interface for implementing scheduling algorithm for 
standalone deployment
 Key: SPARK-9637
 URL: https://issues.apache.org/jira/browse/SPARK-9637
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Liang-Chi Hsieh


We want to abstract the interface of scheduling algorithm for standalone 
deployment mode. It can benefit for implementing different scheduling 
algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655166#comment-14655166
 ] 

Apache Spark commented on SPARK-9637:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7958

 Add interface for implementing scheduling algorithm for standalone deployment
 -

 Key: SPARK-9637
 URL: https://issues.apache.org/jira/browse/SPARK-9637
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Liang-Chi Hsieh

 We want to abstract the interface of scheduling algorithm for standalone 
 deployment mode. It can benefit for implementing different scheduling 
 algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9637:
---

Assignee: (was: Apache Spark)

 Add interface for implementing scheduling algorithm for standalone deployment
 -

 Key: SPARK-9637
 URL: https://issues.apache.org/jira/browse/SPARK-9637
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Liang-Chi Hsieh

 We want to abstract the interface of scheduling algorithm for standalone 
 deployment mode. It can benefit for implementing different scheduling 
 algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9637:
---

Assignee: Apache Spark

 Add interface for implementing scheduling algorithm for standalone deployment
 -

 Key: SPARK-9637
 URL: https://issues.apache.org/jira/browse/SPARK-9637
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 We want to abstract the interface of scheduling algorithm for standalone 
 deployment mode. It can benefit for implementing different scheduling 
 algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655180#comment-14655180
 ] 

Apache Spark commented on SPARK-9563:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7959

 Remove repartition operators when they are the child of Exchange and 
 shuffle=True
 -

 Key: SPARK-9563
 URL: https://issues.apache.org/jira/browse/SPARK-9563
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen

 Consider the following query:
 {code}
 val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 df1.repartition(1000).join(df2, _1).explain(true)
 {code}
 Here's the plan for this query as of Spark 1.4.1:
 {code}
 == Parsed Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Analyzed Logical Plan ==
 _1: int, _2: int, _2: int
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Optimized Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Physical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  ShuffledHashJoin [_1#68991], [_1#68993], BuildRight
   Exchange (HashPartitioning 200)
Repartition 1000, true
 PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at 
 createDataFrame at console:29
   Exchange (HashPartitioning 200)
PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at 
 createDataFrame at console:30
 {code}
 In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which 
 involves a shuffle, only to turn around and shuffle again as part of the 
 exchange.
 To avoid this extra shuffle, I think that we should remove the Repartition 
 when the following condition holds:
 - Exchange's child is a repartition operator where shuffle=True.
 We should not perform this collapsing when shuffle=False, since there might 
 be a legitimate reason to coalesce before shuffling (reducing the number of 
 map outputs that need to be tracked, for instance).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9563:
---

Assignee: Apache Spark

 Remove repartition operators when they are the child of Exchange and 
 shuffle=True
 -

 Key: SPARK-9563
 URL: https://issues.apache.org/jira/browse/SPARK-9563
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark

 Consider the following query:
 {code}
 val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 df1.repartition(1000).join(df2, _1).explain(true)
 {code}
 Here's the plan for this query as of Spark 1.4.1:
 {code}
 == Parsed Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Analyzed Logical Plan ==
 _1: int, _2: int, _2: int
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Optimized Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Physical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  ShuffledHashJoin [_1#68991], [_1#68993], BuildRight
   Exchange (HashPartitioning 200)
Repartition 1000, true
 PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at 
 createDataFrame at console:29
   Exchange (HashPartitioning 200)
PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at 
 createDataFrame at console:30
 {code}
 In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which 
 involves a shuffle, only to turn around and shuffle again as part of the 
 exchange.
 To avoid this extra shuffle, I think that we should remove the Repartition 
 when the following condition holds:
 - Exchange's child is a repartition operator where shuffle=True.
 We should not perform this collapsing when shuffle=False, since there might 
 be a legitimate reason to coalesce before shuffling (reducing the number of 
 map outputs that need to be tracked, for instance).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9563) Remove repartition operators when they are the child of Exchange and shuffle=True

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9563:
---

Assignee: (was: Apache Spark)

 Remove repartition operators when they are the child of Exchange and 
 shuffle=True
 -

 Key: SPARK-9563
 URL: https://issues.apache.org/jira/browse/SPARK-9563
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen

 Consider the following query:
 {code}
 val df1 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 val df2 = sqlContext.createDataFrame(sc.parallelize(1 to 100, 100).map(x = 
 (x, x)))
 df1.repartition(1000).join(df2, _1).explain(true)
 {code}
 Here's the plan for this query as of Spark 1.4.1:
 {code}
 == Parsed Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Analyzed Logical Plan ==
 _1: int, _2: int, _2: int
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Optimized Logical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  Join Inner, Some((_1#68991 = _1#68993))
   Repartition 1000, true
LogicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at createDataFrame 
 at console:29
   LogicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at createDataFrame 
 at console:30
 == Physical Plan ==
 Project [_1#68991,_2#68992,_2#68994]
  ShuffledHashJoin [_1#68991], [_1#68993], BuildRight
   Exchange (HashPartitioning 200)
Repartition 1000, true
 PhysicalRDD [_1#68991,_2#68992], MapPartitionsRDD[82530] at 
 createDataFrame at console:29
   Exchange (HashPartitioning 200)
PhysicalRDD [_1#68993,_2#68994], MapPartitionsRDD[82533] at 
 createDataFrame at console:30
 {code}
 In this plan, we end up repartitioning {{df1}} to have 1000 partitions, which 
 involves a shuffle, only to turn around and shuffle again as part of the 
 exchange.
 To avoid this extra shuffle, I think that we should remove the Repartition 
 when the following condition holds:
 - Exchange's child is a repartition operator where shuffle=True.
 We should not perform this collapsing when shuffle=False, since there might 
 be a legitimate reason to coalesce before shuffling (reducing the number of 
 map outputs that need to be tracked, for instance).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9638) .save() Procedure fails

2015-08-05 Thread Stijn Geuens (JIRA)
Stijn Geuens created SPARK-9638:
---

 Summary: .save() Procedure fails
 Key: SPARK-9638
 URL: https://issues.apache.org/jira/browse/SPARK-9638
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.1
Reporter: Stijn Geuens


I am not able to save a MatrixFactorizationModel I created. 
Path ./Models exists.

Working with pyspark in IPython notebook (spark version = 1.4.1, hadoop version 
= 2.6)

Error message:

---
Py4JJavaError Traceback (most recent call last)
ipython-input-14-28d4a0d852bb in module()
 1 CFMFModel11.save(sc, ./Models/CFMFModel11)

C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\pyspark\mllib\util.pyc 
in save(self, sc, path)
202 
203 def save(self, sc, path):
-- 204 self._java_model.save(sc._jsc.sc(), path)
205 
206 

C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
-- 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o334.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1823.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1823.0 
(TID 489, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:656)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:490)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:462)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at 

[jira] [Commented] (SPARK-9638) .save() Procedure fails

2015-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655214#comment-14655214
 ] 

Sean Owen commented on SPARK-9638:
--

I think this is because you are on Windows and you may not have Hadoop 
installed and/or HADOOP_HOME set. It needs some support binaries on windows to 
interact with the FS. That is I think this is the same as SPARK-2356 underneath.

 .save() Procedure fails
 ---

 Key: SPARK-9638
 URL: https://issues.apache.org/jira/browse/SPARK-9638
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.1
Reporter: Stijn Geuens

 I am not able to save a MatrixFactorizationModel I created. 
 Path ./Models exists.
 Working with pyspark in IPython notebook (spark version = 1.4.1, hadoop 
 version = 2.6)
 Error message:
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-14-28d4a0d852bb in module()
  1 CFMFModel11.save(sc, ./Models/CFMFModel11)
 C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\pyspark\mllib\util.pyc
  in save(self, sc, path)
 202 
 203 def save(self, sc, path):
 -- 204 self._java_model.save(sc._jsc.sc(), path)
 205 
 206 
 C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 C:\Users\s.geuens\Spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o334.save.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1823.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 1823.0 (TID 489, localhost): java.lang.NullPointerException
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
   at org.apache.hadoop.util.Shell.run(Shell.java:455)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:656)
   at 
 org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:490)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:462)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
   at 
 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
   at 
 

[jira] [Resolved] (SPARK-9593) Hive ShimLoader loads wrong Hadoop shims when Spark is compiled against Hadoop 2.0.0-mr1-cdh4.1.1

2015-08-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9593.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7929
[https://github.com/apache/spark/pull/7929]

 Hive ShimLoader loads wrong Hadoop shims when Spark is compiled against 
 Hadoop 2.0.0-mr1-cdh4.1.1
 -

 Key: SPARK-9593
 URL: https://issues.apache.org/jira/browse/SPARK-9593
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.5.0


 Internally, Hive {{ShimLoader}} tries to load different versions of Hadoop 
 shims by checking version information gathered from Hadoop jar files.  If the 
 major version number is 1, {{Hadoop20SShims}} will be loaded.  Otherwise, if 
 the major version number is 2, {{Hadoop23Shims}} will be chosen.  However, 
 CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1 have 2 as major version number, 
 but contain Hadoop 1 code.  This confuses Hive {{ShimLoader}} and loads wrong 
 version of shims.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-05 Thread Philipp Angerer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655310#comment-14655310
 ] 

Philipp Angerer commented on SPARK-9636:


everything is more obvious than picing a location relative to the binary ;)

and the location is reported anyway since the {{start-master.sh}} script 
outputs {{starting org.apache.spark.deploy.master.Master, logging to 
/home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}}

about write permissions, mind that i suggest testing them sequentially until 
one is found that can be written to. that’s IMHO a more sensible default than 
failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an 
environment variable exists, and then retrying with that variable set.

 Treat $SPARK_HOME as write-only
 ---

 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer
Priority: Minor
  Labels: easyfix

 when starting spark scripts as user and it is installed in a directory the 
 user has no write permissions on, many things work fine, except for the logs 
 (e.g. for {{start-master.sh}})
 logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
 {{$SPARK_HOME/logs}}.
 if installed in this way, it should, instead of throwing an error, write logs 
 to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
 in sequence for writability before trying to use one. i suggest using 
 {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
 → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9636) Treat $SPARK_HOME as write-only

2015-08-05 Thread Philipp Angerer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14655310#comment-14655310
 ] 

Philipp Angerer edited comment on SPARK-9636 at 8/5/15 1:04 PM:


everything is more obvious than picking a location relative to the binary ;)

and the location is reported anyway since the {{start-master.sh}} script 
outputs {{starting org.apache.spark.deploy.master.Master, logging to 
/home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}}

about write permissions, mind that i suggest testing them sequentially until 
one is found that can be written to. that’s IMHO a more sensible default than 
failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an 
environment variable exists, and then retrying with that variable set.


was (Author: angerer):
everything is more obvious than picing a location relative to the binary ;)

and the location is reported anyway since the {{start-master.sh}} script 
outputs {{starting org.apache.spark.deploy.master.Master, logging to 
/home/user/.cache/spark-logs/spark-user-org.apache.spark.deploy.master.Master-1-hostname.out}}

about write permissions, mind that i suggest testing them sequentially until 
one is found that can be written to. that’s IMHO a more sensible default than 
failing, and having to {{grep -i 'log' $SPARK_HOME/sbin/*.sh}} to find that an 
environment variable exists, and then retrying with that variable set.

 Treat $SPARK_HOME as write-only
 ---

 Key: SPARK-9636
 URL: https://issues.apache.org/jira/browse/SPARK-9636
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.4.1
 Environment: Linux
Reporter: Philipp Angerer
Priority: Minor
  Labels: easyfix

 when starting spark scripts as user and it is installed in a directory the 
 user has no write permissions on, many things work fine, except for the logs 
 (e.g. for {{start-master.sh}})
 logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
 {{$SPARK_HOME/logs}}.
 if installed in this way, it should, instead of throwing an error, write logs 
 to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
 in sequence for writability before trying to use one. i suggest using 
 {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
 → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped

2015-08-05 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-9639:
---

 Summary: JobHandler may throw NPE if JobScheduler has been stopped
 Key: SPARK-9639
 URL: https://issues.apache.org/jira/browse/SPARK-9639
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


Because `JobScheduler.stop(false)` may set `eventLoop` to null when 
`JobHandler` is running, then it's possible that when `post` is called, 
`eventLoop` happens to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9639:
---

Assignee: Apache Spark

 JobHandler may throw NPE if JobScheduler has been stopped
 -

 Key: SPARK-9639
 URL: https://issues.apache.org/jira/browse/SPARK-9639
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Because `JobScheduler.stop(false)` may set `eventLoop` to null when 
 `JobHandler` is running, then it's possible that when `post` is called, 
 `eventLoop` happens to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658195#comment-14658195
 ] 

Apache Spark commented on SPARK-9639:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7960

 JobHandler may throw NPE if JobScheduler has been stopped
 -

 Key: SPARK-9639
 URL: https://issues.apache.org/jira/browse/SPARK-9639
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu

 Because `JobScheduler.stop(false)` may set `eventLoop` to null when 
 `JobHandler` is running, then it's possible that when `post` is called, 
 `eventLoop` happens to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9639) JobHandler may throw NPE if JobScheduler has been stopped

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9639:
---

Assignee: (was: Apache Spark)

 JobHandler may throw NPE if JobScheduler has been stopped
 -

 Key: SPARK-9639
 URL: https://issues.apache.org/jira/browse/SPARK-9639
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu

 Because `JobScheduler.stop(false)` may set `eventLoop` to null when 
 `JobHandler` is running, then it's possible that when `post` is called, 
 `eventLoop` happens to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated

2015-08-05 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-9640:


 Summary: Do not run Python Kinesis tests when the Kinesis assembly 
JAR has not been generated
 Key: SPARK-9640
 URL: https://issues.apache.org/jira/browse/SPARK-9640
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9640:
---

Assignee: Apache Spark  (was: Tathagata Das)

 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been 
 generated
 

 Key: SPARK-9640
 URL: https://issues.apache.org/jira/browse/SPARK-9640
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9640:
---

Assignee: Tathagata Das  (was: Apache Spark)

 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been 
 generated
 

 Key: SPARK-9640
 URL: https://issues.apache.org/jira/browse/SPARK-9640
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9640) Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658243#comment-14658243
 ] 

Apache Spark commented on SPARK-9640:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/7961

 Do not run Python Kinesis tests when the Kinesis assembly JAR has not been 
 generated
 

 Key: SPARK-9640
 URL: https://issues.apache.org/jira/browse/SPARK-9640
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-05 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-9641:


 Summary: spark.shuffle.service.port is not documented
 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Reporter: Thomas Graves


Looking at the code I see spark.shuffle.service.port being used but I can't 
find any documentation on it.   I don't see a reason for this to be an internal 
config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9618) SQLContext.read.schema().parquet() ignores the supplied schema

2015-08-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9618.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7947
[https://github.com/apache/spark/pull/7947]

 SQLContext.read.schema().parquet() ignores the supplied schema
 --

 Key: SPARK-9618
 URL: https://issues.apache.org/jira/browse/SPARK-9618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Nathan Howell
Assignee: Nathan Howell
Priority: Minor
 Fix For: 1.5.0


 If a user supplies a schema when loading a Parquet file it is ignored and the 
 schema is read off disk instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8862) Add a web UI page that visualizes physical plans (SparkPlan)

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658302#comment-14658302
 ] 

Apache Spark commented on SPARK-8862:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7962

 Add a web UI page that visualizes physical plans (SparkPlan)
 

 Key: SPARK-8862
 URL: https://issues.apache.org/jira/browse/SPARK-8862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 We currently have the ability to visualize part of the query plan using the 
 Spark DAG viz. However, that does NOT work for one of the most important 
 operators: broadcast join. The reason is that broadcast join launches 
 multiple Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark

2015-08-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6486:
-
Fix Version/s: (was: 1.6.0)
   1.5.0

 Add BlockMatrix in PySpark
 --

 Key: SPARK-6486
 URL: https://issues.apache.org/jira/browse/SPARK-6486
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Mike Dusenberry
 Fix For: 1.5.0


 We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
 MatrixUDT for serialization. This JIRA should contain conversions between 
 IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
 linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9381) Migrate JSON data source to the new partitioning data source

2015-08-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9381.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7696
[https://github.com/apache/spark/pull/7696]

 Migrate JSON data source to the new partitioning data source
 

 Key: SPARK-9381
 URL: https://issues.apache.org/jira/browse/SPARK-9381
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9381) Migrate JSON data source to the new partitioning data source

2015-08-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9381:
--
Assignee: Cheng Hao

 Migrate JSON data source to the new partitioning data source
 

 Key: SPARK-9381
 URL: https://issues.apache.org/jira/browse/SPARK-9381
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658328#comment-14658328
 ] 

Sean Owen commented on SPARK-9641:
--

Agree. the .enabled flag is mentioned but not documented either. Want to make a 
PR or should I?

 spark.shuffle.service.port is not documented
 

 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Shuffle
Reporter: Thomas Graves

 Looking at the code I see spark.shuffle.service.port being used but I can't 
 find any documentation on it.   I don't see a reason for this to be an 
 internal config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9641) spark.shuffle.service.port is not documented

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9641:
-
   Priority: Minor  (was: Major)
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

 spark.shuffle.service.port is not documented
 

 Key: SPARK-9641
 URL: https://issues.apache.org/jira/browse/SPARK-9641
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Shuffle
Reporter: Thomas Graves
Priority: Minor

 Looking at the code I see spark.shuffle.service.port being used but I can't 
 find any documentation on it.   I don't see a reason for this to be an 
 internal config so we should document it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,

2015-08-05 Thread Perinkulam I Ganesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658368#comment-14658368
 ] 

Perinkulam I Ganesh commented on SPARK-5544:


It seems like this JIRA got resolved by SPARK 7155... please double check.

thanks

- P. I. 

 wholeTextFiles should recognize multiple input paths delimited by ,
 ---

 Key: SPARK-5544
 URL: https://issues.apache.org/jira/browse/SPARK-5544
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng

 textFile takes delimited paths in a single path string. wholeTextFiles should 
 behave the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8486) SIFT Feature Transformer

2015-08-05 Thread K S Sreenivasa Raghavan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658377#comment-14658377
 ] 

K S Sreenivasa Raghavan commented on SPARK-8486:


How to accept this issue?
Should I use scala or python?

 SIFT Feature Transformer
 

 Key: SPARK-8486
 URL: https://issues.apache.org/jira/browse/SPARK-8486
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 Scale invariant feature transform (SIFT) is a scale and rotation invariant 
 method to transform images into matrices describing local features. (Lowe, 
 IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf)
 We can implement SIFT in Spark ML pipelines as a 
 org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the 
 SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features 
 for the provided image.
 The implementation should support computation of SIFT at predefined interest 
 points, every kth pixel, and densely (over all pixels). Furthermore, the 
 implementation should support various approximations for approximating the 
 Laplacian of Gaussian using Difference of Gaussian (as described by Lowe).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets

2015-08-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658376#comment-14658376
 ] 

Seth Hendrickson commented on SPARK-8971:
-

[~mengxr] You mentioned that the solution should call {{sampleByKeyExact}}, 
which is a function that takes a stratified subsample of m  N elements from a 
dataset. One problem is that when doing things like train/test split and k fold 
creation (which are fundamentally the same as far as sampling goes) is that we 
actually need to take random splits of the dataset. That is, we need not only 
the subsample, but its complement. For k-fold sampling, we need to split the 
dataset into k unique, non-overlapping subsamples, which isn't possible with 
{{samplyByKeyExact}} in its current state.

I have a pretty coarse prototype which essentially uses the [efficient, 
parallel sampling routine|http://jmlr.org/proceedings/papers/v28/meng13a.html] 
to find the exact k thresholds needed to split the dataset into k subsamples. I 
had to modify the sampling function in 
{{org.apache.spark.util.random.StratifiedSamplingUtils}} to compare the random 
keys to a range (e.g. x  lb  x = ub), rather than simply comparing to one 
number (x  threshold) which only allows for a bisection of the data. Once you 
know the exact k-1 thresholds that provide even splits for each stratum, and 
you have a sampling function that can compare the random key to a range, you 
have what you need to for stratified k-fold and train/test split. Is there a 
way to implement this without touching the {{org.apache.spark.util.random}} 
package that I'm missing?

 Support balanced class labels when splitting train/cross validation sets
 

 Key: SPARK-8971
 URL: https://issues.apache.org/jira/browse/SPARK-8971
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Feynman Liang
Assignee: Seth Hendrickson

 {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
 Spark classes which partition data into training and evaluation sets for 
 performing hyperparameter selection via cross validation.
 Both methods currently perform the split by randomly sampling the datasets. 
 However, when class probabilities are highly imbalanced (e.g. detection of 
 extremely low-frequency events), random sampling may result in cross 
 validation sets not representative of actual out-of-training performance 
 (e.g. no positive training examples could be included).
 Mainstream R packages like already 
 [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
 data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4412) Parquet logger cannot be configured

2015-08-05 Thread Stephen Carman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658378#comment-14658378
 ] 

Stephen Carman commented on SPARK-4412:
---

This also happens to me a lot in Spark 1.4.0 perhaps this could be tested on 
the 1.4 branch as well?

 Parquet logger cannot be configured
 ---

 Key: SPARK-4412
 URL: https://issues.apache.org/jira/browse/SPARK-4412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
Reporter: Jim Carroll

 The Spark ParquetRelation.scala code makes the assumption that the 
 parquet.Log class has already been loaded. If 
 ParquetRelation.enableLogForwarding executes prior to the parquet.Log class 
 being loaded then the code in enableLogForwarding has no affect.
 ParquetRelation.scala attempts to override the parquet logger but, at least 
 currently (and if your application simply reads a parquet file before it does 
 anything else with Parquet), the parquet.Log class hasn't been loaded yet. 
 Therefore the code in ParquetRelation.enableLogForwarding has no affect. If 
 you look at the code in parquet.Log there's a static initializer that needs 
 to be called prior to enableLogForwarding or whatever enableLogForwarding 
 does gets undone by this static initializer.
 The fix would be to force the static initializer to get called in 
 parquet.Log as part of enableForwardLogging. 
 PR will be forthcomming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,

2015-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5544.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.5.0)

 wholeTextFiles should recognize multiple input paths delimited by ,
 ---

 Key: SPARK-5544
 URL: https://issues.apache.org/jira/browse/SPARK-5544
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng

 textFile takes delimited paths in a single path string. wholeTextFiles should 
 behave the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6227) PCA and SVD for PySpark

2015-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6227:
---

Assignee: Apache Spark

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot
Assignee: Apache Spark

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658403#comment-14658403
 ] 

Apache Spark commented on SPARK-6227:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/7963

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >