[jira] [Created] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread guowei (JIRA)
guowei created SPARK-8865:
-

 Summary: Fix bug:  init SimpleConsumerConfig with kafka params
 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616245#comment-14616245
 ] 

Apache Spark commented on SPARK-8865:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/7254

 Fix bug:  init SimpleConsumerConfig with kafka params
 -

 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8865:
---

Assignee: (was: Apache Spark)

 Fix bug:  init SimpleConsumerConfig with kafka params
 -

 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3164) Store DecisionTree Split.categories as Set

2015-07-07 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246
 ] 

Rekha Joshi commented on SPARK-3164:


it was not assigned, nor closed/commented, and improvement suggestion needed in 
latest snapshot.but ack on decision made on api stability. thanks for letting 
me know [~josephkb] 
- - sad, as i updated 5 files for a trivial pursuit, with the scala style 
checks :-) :-) 


 Store DecisionTree Split.categories as Set
 --

 Key: SPARK-3164
 URL: https://issues.apache.org/jira/browse/SPARK-3164
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.4.0


 Improvement: computation
 For categorical features with many categories, it could be more efficient to 
 store Split.categories as a Set, not a List.  (It is currently a List.)  A 
 Set might be more scalable (for log n lookups), though tests would need to be 
 done to ensure that Sets do not incur too much more overhead than Lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8865:
---

Assignee: Apache Spark

 Fix bug:  init SimpleConsumerConfig with kafka params
 -

 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
Assignee: Apache Spark
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3164) Store DecisionTree Split.categories as Set

2015-07-07 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246
 ] 

Rekha Joshi edited comment on SPARK-3164 at 7/7/15 6:15 AM:


it was not assigned, nor closed/commented, and improvement suggestion needed in 
latest snapshot.but ack on decision made on api stability. thanks for letting 
me know [~josephkb] 
sad, as i updated 5 files for a trivial pursuit, with the scala style checks 
:-) :-) 



was (Author: rekhajoshm):
it was not assigned, nor closed/commented, and improvement suggestion needed in 
latest snapshot.but ack on decision made on api stability. thanks for letting 
me know [~josephkb] 
# sad, as i updated 5 files for a trivial pursuit, with the scala style checks 
:-) :-) 


 Store DecisionTree Split.categories as Set
 --

 Key: SPARK-3164
 URL: https://issues.apache.org/jira/browse/SPARK-3164
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.4.0


 Improvement: computation
 For categorical features with many categories, it could be more efficient to 
 store Split.categories as a Set, not a List.  (It is currently a List.)  A 
 Set might be more scalable (for log n lookups), though tests would need to be 
 done to ensure that Sets do not incur too much more overhead than Lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3164) Store DecisionTree Split.categories as Set

2015-07-07 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246
 ] 

Rekha Joshi edited comment on SPARK-3164 at 7/7/15 6:15 AM:


it was not assigned, nor closed/commented, and improvement suggestion needed in 
latest snapshot.but ack on decision made on api stability. thanks for letting 
me know [~josephkb] 
# sad, as i updated 5 files for a trivial pursuit, with the scala style checks 
:-) :-) 



was (Author: rekhajoshm):
it was not assigned, nor closed/commented, and improvement suggestion needed in 
latest snapshot.but ack on decision made on api stability. thanks for letting 
me know [~josephkb] 
- - sad, as i updated 5 files for a trivial pursuit, with the scala style 
checks :-) :-) 


 Store DecisionTree Split.categories as Set
 --

 Key: SPARK-3164
 URL: https://issues.apache.org/jira/browse/SPARK-3164
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.4.0


 Improvement: computation
 For categorical features with many categories, it could be more efficient to 
 store Split.categories as a Set, not a List.  (It is currently a List.)  A 
 Set might be more scalable (for log n lookups), though tests would need to be 
 done to ensure that Sets do not incur too much more overhead than Lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-8868:
---

Assignee: Yin Huai

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8840) Float type coercion with hiveContext

2015-07-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616918#comment-14616918
 ] 

Shivaram Venkataraman commented on SPARK-8840:
--

Sorry my question is can you reproduce this in Scala shell or PySpark shell ?

 Float type coercion with hiveContext
 

 Key: SPARK-8840
 URL: https://issues.apache.org/jira/browse/SPARK-8840
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Evgeny SInelnikov

 Problem with +float+ type coercion on SparkR with hiveContext.
 {code}
  result - sql(hiveContext, SELECT offset, percentage from data limit 100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {code}
 This trouble looks like already exists (SPARK-2863 - Emulate Hive type
 coercion in native reimplementations of Hive functions) with same
 reason - not completed native reimplementations of Hive... not
 ...functions only.
 I used spark 1.4.0 binaries from official site:
 http://spark.apache.org/downloads.html
 And running it on:
 * Hortonworks HDP 2.2.0.0-2041
 * with Hive 0.14
 * with disabled hooks for Application Timeline Servers (ATSHook) in 
 hive-site.xml, commented:
 ** hive.exec.failure.hooks,
 ** hive.exec.post.hooks,
 ** hive.exec.pre.hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8821) The ec2 script doesn't run on python 3 with an utf8 env

2015-07-07 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8821:
-
Assignee: Simon Hafner

 The ec2 script doesn't run on python 3 with an utf8 env
 ---

 Key: SPARK-8821
 URL: https://issues.apache.org/jira/browse/SPARK-8821
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
 Environment: Archlinux, UTF8 LANG env
Reporter: Simon Hafner
Assignee: Simon Hafner
 Fix For: 1.4.2, 1.5.0


 Otherwise the script will crash with
  - Downloading boto...
 Traceback (most recent call last):
   File ec2/spark_ec2.py, line 148, in module
 setup_external_libs(external_libs)
   File ec2/spark_ec2.py, line 128, in setup_external_libs
 if hashlib.md5(tar.read()).hexdigest() != lib[md5]:
   File /usr/lib/python3.4/codecs.py, line 319, in decode
 (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: 
 invalid start byte
 In case of an utf8 env setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8704:
---
Summary: Add missing methods in StandardScaler (ML and PySpark)  (was: Add 
missing methods in StandardScaler)

 Add missing methods in StandardScaler (ML and PySpark)
 --

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8704:
---
Summary: Add missing methods in StandardScaler  (was: Add additional 
methods to wrappers in ml.pyspark.feature)

 Add missing methods in StandardScaler
 -

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8823) Optimizations for sparse vector products in pyspark.mllib.linalg

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8823.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7222
[https://github.com/apache/spark/pull/7222]

 Optimizations for sparse vector products in pyspark.mllib.linalg
 

 Key: SPARK-8823
 URL: https://issues.apache.org/jira/browse/SPARK-8823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
Priority: Minor
 Fix For: 1.5.0


 Currently we iterate over indices and values of both the sparse vectors that 
 can be vectorized in NumPy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8821) The ec2 script doesn't run on python 3 with an utf8 env

2015-07-07 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8821.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

Issue resolved by pull request 7215
[https://github.com/apache/spark/pull/7215]

 The ec2 script doesn't run on python 3 with an utf8 env
 ---

 Key: SPARK-8821
 URL: https://issues.apache.org/jira/browse/SPARK-8821
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
 Environment: Archlinux, UTF8 LANG env
Reporter: Simon Hafner
 Fix For: 1.4.2, 1.5.0


 Otherwise the script will crash with
  - Downloading boto...
 Traceback (most recent call last):
   File ec2/spark_ec2.py, line 148, in module
 setup_external_libs(external_libs)
   File ec2/spark_ec2.py, line 128, in setup_external_libs
 if hashlib.md5(tar.read()).hexdigest() != lib[md5]:
   File /usr/lib/python3.4/codecs.py, line 319, in decode
 (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: 
 invalid start byte
 In case of an utf8 env setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8873) Support cleaning up shuffle files for drivers launched with Mesos

2015-07-07 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-8873:
---

 Summary: Support cleaning up shuffle files for drivers launched 
with Mesos
 Key: SPARK-8873
 URL: https://issues.apache.org/jira/browse/SPARK-8873
 Project: Spark
  Issue Type: Improvement
Reporter: Timothy Chen


With dynamic allocation enabled with Mesos, drivers can launch with shuffle 
data cached in the external shuffle service.
However, there is no reliable way to let the shuffle service clean up the 
shuffle data when the driver exits, since it may crash before it notifies the 
shuffle service and shuffle data will be cached forever.
We need to implement a reliable way to detect driver termination and clean up 
shuffle data accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8871) Add maximal frequent itemsets filter in Spark MLib FPGrowth

2015-07-07 Thread Jonathan Svirsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Svirsky updated SPARK-8871:

Description: Maximal frequent itemsets can be exctracted as all 
root-to-leaf paths(sets) from FP-Trees.  (was: Maximal frequent itemsets can be 
exctracted as all root-to-leaf paths from FP-Trees)

 Add maximal frequent itemsets filter in Spark MLib FPGrowth
 ---

 Key: SPARK-8871
 URL: https://issues.apache.org/jira/browse/SPARK-8871
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jonathan Svirsky

 Maximal frequent itemsets can be exctracted as all root-to-leaf paths(sets) 
 from FP-Trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8872) Improve FPGrowthSuite with equivalent R code

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8872:
-
Description: In `FPGrowthSuite`, we only tested output with minSupport 0.5, 
where the expected output is hard-coded. We can add equivalent R code using the 
arules package to generate the expect output for validation purpose, similar to 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
 and the test code in https://github.com/apache/spark/pull/7005.  (was: In 
`FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected 
output is hard-coded. We can add equivalent R code to generate the expect 
output for validation purpose, similar to 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98.)

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8868:

Target Version/s: 1.5.0

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8868:
---

Assignee: Yin Huai  (was: Apache Spark)

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8868:
---

Assignee: Apache Spark  (was: Yin Huai)

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8868:

Shepherd: Josh Rosen

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617015#comment-14617015
 ] 

Xiangrui Meng edited comment on SPARK-6485 at 7/7/15 5:20 PM:
--

[~mwdus...@us.ibm.com] Thanks for working on this! Any ETA? Please keep the 
first PR minimal and re-use code from Scala. We can split this JIRA into small 
ones if necessary.


was (Author: mengxr):
[~mwdus...@us.ibm.com] Thanks for working on this! Any ETA?

 Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 --

 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
 PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8744) StringIndexerModel should have public constructor

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8744:
-
Target Version/s: 1.5.0

 StringIndexerModel should have public constructor
 -

 Key: SPARK-8744
 URL: https://issues.apache.org/jira/browse/SPARK-8744
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 It would be helpful to allow users to pass a pre-computed index to create an 
 indexer, rather than always going through StringIndexer to create the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8870) Use SQLContext.getOrCreate in model save/load

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8870:
-
Remaining Estimate: 2h
 Original Estimate: 2h

 Use SQLContext.getOrCreate in model save/load
 -

 Key: SPARK-8870
 URL: https://issues.apache.org/jira/browse/SPARK-8870
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h

 In many model save/load code, we use `new SQLContext(sc)` to create a 
 SQLContext. This could be replace by `SQLContext.getOrCreate(sc)` to reuse an 
 existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616933#comment-14616933
 ] 

Apache Spark commented on SPARK-8868:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7262

 SqlSerializer2 can go into infinite loop when row consists only of NullType 
 columns
 ---

 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Assignee: Yin Huai
Priority: Minor

 The following SQL query will cause an infinite loop in SqlSerializer2 code:
 {code}
 val df = sqlContext.sql(select null where 1 = 1)
 df.unionAll(df).sort(_c0).collect()
 {code}
 The same problem occurs if we add more null-literals, but does not occur as 
 long as there is a column of any other type (e.g. {{select 1, null where 1 == 
 1}}).
 I think that what's happening here is that if you have a row that consists 
 only of columns of NullType (not columns of other types which happen to only 
 contain null values, but only columns of null literals), SqlSerializer will 
 end up writing / reading no data for rows of this type.  Since the 
 deserialization stream will never try to read any data but nevertheless will 
 be able to return an empty row, DeserializationStream.asIterator will go into 
 an infinite loop since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6485:
-
Assignee: (was: Manoj Kumar)

 Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 --

 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
 PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment LGTM. For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-6487)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)
* Isotonic regression (SPARK-8671)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)
* Python API for streaming ML algorithms (SPARK-3258)
* Add missing model methods (SPARK-8633)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list 

[jira] [Created] (SPARK-8871) Add maximal frequent itemsets filter in Spark MLib FPGrowth

2015-07-07 Thread Jonathan Svirsky (JIRA)
Jonathan Svirsky created SPARK-8871:
---

 Summary: Add maximal frequent itemsets filter in Spark MLib 
FPGrowth
 Key: SPARK-8871
 URL: https://issues.apache.org/jira/browse/SPARK-8871
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jonathan Svirsky


Maximal frequent itemsets can be exctracted as all root-to-leaf paths from 
FP-Trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8711) Add additional methods to JavaModel wrappers in trees

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8711.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Add additional methods to JavaModel wrappers in trees
 -

 Key: SPARK-8711
 URL: https://issues.apache.org/jira/browse/SPARK-8711
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617015#comment-14617015
 ] 

Xiangrui Meng commented on SPARK-6485:
--

[~mwdus...@us.ibm.com] Thanks for working on this! Any ETA?

 Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 --

 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
 PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment LGTM. For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)
* Isotonic regression (SPARK-8671)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)
* Python API for streaming ML algorithms (SPARK-3258)
* Add missing model methods (SPARK-8633)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list 

[jira] [Created] (SPARK-8872) Improve FPGrowthSuite with equivalent R code

2015-07-07 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8872:


 Summary: Improve FPGrowthSuite with equivalent R code
 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor


In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
expected output is hard-coded. We can add equivalent R code to generate the 
expect output for validation purpose, similar to 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-07-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616924#comment-14616924
 ] 

Shivaram Venkataraman commented on SPARK-8596:
--

You can test this by launching a new cluster with a command that looks like 

{code}
./spark-ec2 -s 2 -t r3.xlarge -i pem -k key --spark-ec2-git-repo 
https://github.com/koaning/spark-ec2 --spark-ec2-git-branch rstudio-install 
launch rstudio-test
{code}

This cluster setup will now use the spark-ec2 scripts from your repo while 
setting things up. Once you think its good, you can open a PR on 
github.com/mesos/spark-ec2

 Install and configure RStudio server on Spark EC2
 -

 Key: SPARK-8596
 URL: https://issues.apache.org/jira/browse/SPARK-8596
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman

 This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType

2015-07-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8866:
--

 Summary: Use 1 microsecond (us) precision for TimestampType
 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


100ns is slightly weird to compute. Let's use 1us to be more consistent with 
other systems (e.g. Postgres) and less error prone.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types

2015-07-07 Thread Stefano Parmesan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Parmesan closed SPARK-8726.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

 Wrong spark.executor.memory when using different EC2 master and worker 
 machine types
 

 Key: SPARK-8726
 URL: https://issues.apache.org/jira/browse/SPARK-8726
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Stefano Parmesan
 Fix For: 1.4.0


 _(this is a mirror of 
 [MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_
 By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
 master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
  when using the same instance type for master and workers you will not 
 notice, but when using different ones (which makes sense, as the master 
 cannot be a spot instance, and using a big machine for the master would be a 
 waste of resources) the default amount of memory given to each worker is 
 capped to the amount of RAM available on the master (ex: if you create a 
 cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB 
 RAM), spark.executor.memory will be set to 512MB).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType

2015-07-07 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616290#comment-14616290
 ] 

Yijie Shen commented on SPARK-8866:
---

I'll take this one.

 Use 1 microsecond (us) precision for TimestampType
 --

 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 100ns is slightly weird to compute. Let's use 1us to be more consistent with 
 other systems (e.g. Postgres) and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-07-07 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616264#comment-14616264
 ] 

hujiayin commented on SPARK-6724:
-

Can I take a look at this issue?

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616301#comment-14616301
 ] 

Adrian Wang commented on SPARK-8864:


just provide the precise of current design for your information.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616299#comment-14616299
 ] 

Adrian Wang commented on SPARK-8864:


no, that's not enough.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-07 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6912:

Summary: Throw an AnalysisException when unsupported Java MapK,V types 
used in Hive UDF  (was: Support MapK,V as a return type in Hive UDF)

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616354#comment-14616354
 ] 

Apache Spark commented on SPARK-6487:
-

User 'zhangjiajin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7258

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin
Assignee: Zhang JiaJin

 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: 
 PrefixSpan: 
 Pei, Jian, et al. Mining sequential patterns by pattern-growth: The 
 prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 
 16.11 (2004): 1424-1440.
 Parallel Algorithm: 
 Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed 
 sequential patterns. Proceedings of the eleventh ACM SIGKDD international 
 conference on Knowledge discovery in data mining. ACM, 2005.
 Distributed Algorithm: 
 Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan 
 algorithm based on MapReduce. Information Technology in Medicine and 
 Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
 Pattern mining and sequential mining Knowledge: 
 Han, Jiawei, et al. Frequent pattern mining: current status and future 
 directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns

2015-07-07 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8868:
-

 Summary: SqlSerializer2 can go into infinite loop when row 
consists only of NullType columns
 Key: SPARK-8868
 URL: https://issues.apache.org/jira/browse/SPARK-8868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Josh Rosen
Priority: Minor


The following SQL query will cause an infinite loop in SqlSerializer2 code:

{code}
val df = sqlContext.sql(select null where 1 = 1)
df.unionAll(df).sort(_c0).collect()
{code}

The same problem occurs if we add more null-literals, but does not occur as 
long as there is a column of any other type (e.g. {{select 1, null where 1 == 
1}}).

I think that what's happening here is that if you have a row that consists only 
of columns of NullType (not columns of other types which happen to only contain 
null values, but only columns of null literals), SqlSerializer will end up 
writing / reading no data for rows of this type.  Since the deserialization 
stream will never try to read any data but nevertheless will be able to return 
an empty row, DeserializationStream.asIterator will go into an infinite loop 
since there will never be a read to trigger an EOF exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616270#comment-14616270
 ] 

Reynold Xin commented on SPARK-8864:


1. Yes - not 100ns. I forgot to remove that paragraph. The table indicates 
using 12 bytes to store interval.

2. That makes sense.


 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs.pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616346#comment-14616346
 ] 

Apache Spark commented on SPARK-6912:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/7257

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8867) Show the UDF usage for user.

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616357#comment-14616357
 ] 

Apache Spark commented on SPARK-8867:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7259

 Show the UDF usage for user.
 

 Key: SPARK-8867
 URL: https://issues.apache.org/jira/browse/SPARK-8867
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Cheng Hao

 As Hive does, we need to provide the feature for user, to show the usage of a 
 UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8864:
---
Attachment: SparkSQLdatetimeudfs (1).pdf

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8864:
---
Attachment: (was: SparkSQLdatetimeudfs.pdf)

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-07-07 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616275#comment-14616275
 ] 

hujiayin commented on SPARK-6724:
-

ok, : )

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth

2015-07-07 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-6724:

Comment: was deleted

(was: Can I take a look at this issue?)

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616272#comment-14616272
 ] 

Reynold Xin commented on SPARK-8864:


I filed https://issues.apache.org/jira/browse/SPARK-8866 to change 
TimestampType itself to 1us precision instead of 100ns.


 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs.pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293
 ] 

Adrian Wang edited comment on SPARK-8864 at 7/7/15 7:34 AM:


Then we are using a Long for us. Long can be up to 9.2E18, which is more than 
1E8 days. Hive is using a Long for seconds and an int for nanoseconds, but I 
think a single Long here for day-time interval is fine.


was (Author: adrian-wang):
Then we are using a Long for us. Long can be up to 9.2E18, which is more than 
1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I 
think a single Long here for day-time interval is fine.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6912:
---

Assignee: (was: Apache Spark)

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6912:
---

Assignee: Apache Spark

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro
Assignee: Apache Spark

 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6487:
---

Assignee: Zhang JiaJin  (was: Apache Spark)

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin
Assignee: Zhang JiaJin

 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: 
 PrefixSpan: 
 Pei, Jian, et al. Mining sequential patterns by pattern-growth: The 
 prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 
 16.11 (2004): 1424-1440.
 Parallel Algorithm: 
 Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed 
 sequential patterns. Proceedings of the eleventh ACM SIGKDD international 
 conference on Knowledge discovery in data mining. ACM, 2005.
 Distributed Algorithm: 
 Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan 
 algorithm based on MapReduce. Information Technology in Medicine and 
 Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
 Pattern mining and sequential mining Knowledge: 
 Han, Jiawei, et al. Frequent pattern mining: current status and future 
 directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8867) Show the UDF usage for user.

2015-07-07 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8867:


 Summary: Show the UDF usage for user.
 Key: SPARK-8867
 URL: https://issues.apache.org/jira/browse/SPARK-8867
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Cheng Hao


As Hive does, we need to provide the feature for user, to show the usage of a 
UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6487:
---

Assignee: Apache Spark  (was: Zhang JiaJin)

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin
Assignee: Apache Spark

 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: 
 PrefixSpan: 
 Pei, Jian, et al. Mining sequential patterns by pattern-growth: The 
 prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 
 16.11 (2004): 1424-1440.
 Parallel Algorithm: 
 Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed 
 sequential patterns. Proceedings of the eleventh ACM SIGKDD international 
 conference on Knowledge discovery in data mining. ACM, 2005.
 Distributed Algorithm: 
 Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan 
 algorithm based on MapReduce. Information Technology in Medicine and 
 Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
 Pattern mining and sequential mining Knowledge: 
 Han, Jiawei, et al. Frequent pattern mining: current status and future 
 directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7884) Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader

2015-07-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7884:
-
Assignee: Matt Massie

 Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader
 -

 Key: SPARK-7884
 URL: https://issues.apache.org/jira/browse/SPARK-7884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matt Massie
Assignee: Matt Massie
 Fix For: 1.5.0


 The current Spark shuffle has some hard-coded assumptions about how shuffle 
 managers will read and write data.
 The BlockStoreShuffleFetcher.fetch method relies on the 
 ShuffleBlockFetcherIterator that assumes shuffle data is written using the 
 BlockManager.getDiskWriter method and doesn't allow for customization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-07-07 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616267#comment-14616267
 ] 

Hrishikesh commented on SPARK-6724:
---

[~hujiayin] sure! 

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616268#comment-14616268
 ] 

Adrian Wang commented on SPARK-8864:


Thanks for the design.

Two comments:
1. If a IntervalType value is in year-month format, we cannot use 100ns to 
represent it. Hive use two internal types to handle year-month and day-time 
separately.
2. When casting TimestampType into StringType, or casting from 
StringType(unless it is a ISO8601 time string which contains timezone info), we 
should also consider timezone.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs.pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8867) Show the UDF usage for user.

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8867:
---

Assignee: (was: Apache Spark)

 Show the UDF usage for user.
 

 Key: SPARK-8867
 URL: https://issues.apache.org/jira/browse/SPARK-8867
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Cheng Hao

 As Hive does, we need to provide the feature for user, to show the usage of a 
 UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8867) Show the UDF usage for user.

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8867:
---

Assignee: Apache Spark

 Show the UDF usage for user.
 

 Key: SPARK-8867
 URL: https://issues.apache.org/jira/browse/SPARK-8867
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 As Hive does, we need to provide the feature for user, to show the usage of a 
 UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616257#comment-14616257
 ] 

Apache Spark commented on SPARK-8851:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7255

 in Yarn client mode, Client.scala does not login even when credentials are 
 specified
 

 Key: SPARK-8851
 URL: https://issues.apache.org/jira/browse/SPARK-8851
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan

 [#6051|https://github.com/apache/spark/pull/6051] added support for passing 
 the credentials configuration from SparkConf, so the client mode works fine. 
 This though created an issue where the Client.scala class does not login to 
 the KDC, thus requiring a kinit before running in Client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8851:
---

Assignee: (was: Apache Spark)

 in Yarn client mode, Client.scala does not login even when credentials are 
 specified
 

 Key: SPARK-8851
 URL: https://issues.apache.org/jira/browse/SPARK-8851
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan

 [#6051|https://github.com/apache/spark/pull/6051] added support for passing 
 the credentials configuration from SparkConf, so the client mode works fine. 
 This though created an issue where the Client.scala class does not login to 
 the KDC, thus requiring a kinit before running in Client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified

2015-07-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8851:
---

Assignee: Apache Spark

 in Yarn client mode, Client.scala does not login even when credentials are 
 specified
 

 Key: SPARK-8851
 URL: https://issues.apache.org/jira/browse/SPARK-8851
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan
Assignee: Apache Spark

 [#6051|https://github.com/apache/spark/pull/6051] added support for passing 
 the credentials configuration from SparkConf, so the client mode works fine. 
 This though created an issue where the Client.scala class does not login to 
 the KDC, thus requiring a kinit before running in Client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType

2015-07-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8866:
---
Assignee: Yijie Shen

 Use 1 microsecond (us) precision for TimestampType
 --

 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen

 100ns is slightly weird to compute. Let's use 1us to be more consistent with 
 other systems (e.g. Postgres) and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293
 ] 

Adrian Wang commented on SPARK-8864:


Then we are using a Long for us. Long can be up to 9.2E18, which is more than 
1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I 
think a single Long here for day-time interval is fine.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616340#comment-14616340
 ] 

Apache Spark commented on SPARK-8685:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7256

 dataframe left joins are not working as expected in pyspark
 ---

 Key: SPARK-8685
 URL: https://issues.apache.org/jira/browse/SPARK-8685
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 1.4.0
 Environment: ubuntu 14.04
Reporter: axel dahl
Assignee: Davies Liu

 I have the following code:
 {code}
 from pyspark import SQLContext
 d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
 {'name':'alice', 'country': 'jpn', 'age': 2}, 
 {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'carol', 'country': 'ire', 'colour':'green'}]
 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)
 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, (df1.name == df2.name)  (df1.country == df2.country), 
 'left_outer').collect()
 {code}
 When I run it I get the following, (notice in the first row, all join keys 
 are take from the right-side and so are blanked out):
 {code}
 [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
 name=u'alice')]
 {code}
 I would expect to get (though ideally without duplicate columns):
 {code}
 [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
 name=u'alice')]
 {code}
 The workaround for now is this rather clunky piece of code:
 {code}
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, (df1.name == df2.name2)  (df1.country == df2.country2), 
 'left_outer').collect()
 {code}
 Also, {{.show()}} works
 {code}
 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, (df1.name == df2.name)  (df1.country == df2.country), 
 'left_outer').show()
 +---+---+-+--+---+-+
 |age|country| name|colour|country| name|
 +---+---+-+--+---+-+
 |  3|ire|carol| green|ire|carol|
 |  2|jpn|alice|  null|   null| null|
 |  1|usa|  bob|   red|usa|  bob|
 +---+---+-+--+---+-+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616296#comment-14616296
 ] 

Reynold Xin commented on SPARK-8864:


Are you suggesting we use a single 8 byte long to store both the number of 
months and the number of microseconds?

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8223) math function: shiftleft

2015-07-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8223:
-
Assignee: Tarek Auel  (was: zhichao-li)

 math function: shiftleft
 

 Key: SPARK-8223
 URL: https://issues.apache.org/jira/browse/SPARK-8223
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0


 shiftleft(INT a)
 shiftleft(BIGINT a)
 Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and 
 int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8224) math function: shiftright

2015-07-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8224:
-
Assignee: Tarek Auel  (was: zhichao-li)

 math function: shiftright
 -

 Key: SPARK-8224
 URL: https://issues.apache.org/jira/browse/SPARK-8224
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0


 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8865:
-
Description: (was: [~guowei2] Again, please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
You're not setting the fields in your JIRAs as requested.)

[~guowei2] Again, please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
You're not setting the fields in your JIRAs as requested.

 Fix bug:  init SimpleConsumerConfig with kafka params
 -

 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params

2015-07-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8865:
-
 Priority: Minor  (was: Major)
Fix Version/s: (was: 1.4.0)
  Description: [~guowei2] Again, please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
You're not setting the fields in your JIRAs as requested.

 Fix bug:  init SimpleConsumerConfig with kafka params
 -

 Key: SPARK-8865
 URL: https://issues.apache.org/jira/browse/SPARK-8865
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: guowei
Priority: Minor

 [~guowei2] Again, please read 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
 first. You're not setting the fields in your JIRAs as requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8674) [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation

2015-07-07 Thread Jose Cambronero (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Cambronero updated SPARK-8674:
---
Summary: [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation  
(was: 2-sample, 2-sided Kolmogorov Smirnov Test Implementation)

 [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
 --

 Key: SPARK-8674
 URL: https://issues.apache.org/jira/browse/SPARK-8674
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Priority: Minor

 We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
 test for 2 RDD[Double]. The calculation provides a test for the null 
 hypothesis that both samples come from the same probability distribution. The 
 implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7422) Add argmax to Vector, SparseVector

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7422:
-
Shepherd: Xiangrui Meng

 Add argmax to Vector, SparseVector
 --

 Key: SPARK-7422
 URL: https://issues.apache.org/jira/browse/SPARK-7422
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 DenseVector has an argmax method which is currently private to Spark.  It 
 would be nice to add that method to Vector and SparseVector.  Adding it to 
 SparseVector would require being careful about handling the inactive elements 
 correctly and efficiently.
 We should make argmax public and add unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8484) Add TrainValidationSplit to ml.tuning

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8484:
-
Priority: Critical  (was: Major)

 Add TrainValidationSplit to ml.tuning
 -

 Key: SPARK-8484
 URL: https://issues.apache.org/jira/browse/SPARK-8484
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
Priority: Critical

 Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the 
 input dataset into train and validation and use evaluation metric on the 
 validation set to select the best model. It should be similar to 
 CrossValidator, but simpler and less expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8627) ALS model predict error

2015-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617284#comment-14617284
 ] 

Xiangrui Meng commented on SPARK-8627:
--

The code looks okay to me. Which Spark version did you use, and Scala version?

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8559) Support association rule generation in FPGrowth

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8559.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7005
[https://github.com/apache/spark/pull/7005]

 Support association rule generation in FPGrowth
 ---

 Key: SPARK-8559
 URL: https://issues.apache.org/jira/browse/SPARK-8559
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guangwen Liu
Assignee: Feynman Liang
 Fix For: 1.5.0


 It will be more useful and practical to include the association rule 
 generation part for real applications, though it is not hard by user to find 
 association rules from the frequent itemset with frequency which is output by 
 FP growth.
 However how to generate association rules in an efficient way is not widely 
 reported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code

2015-07-07 Thread Kashif Rasul (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617123#comment-14617123
 ] 

Kashif Rasul commented on SPARK-8872:
-

I would like to work on this.

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617159#comment-14617159
 ] 

Sean Owen commented on SPARK-7917:
--

I'm thinking of the two I mentioned above, in particular maybe SPARK-7503?

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8840) Float type coercion with hiveContext

2015-07-07 Thread Evgeny SInelnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617267#comment-14617267
 ] 

Evgeny SInelnikov commented on SPARK-8840:
--

I tested it on SparkSQL - problem not reproduced on it.
I looked into spark/R sources and found, than deserialization for _float_ type 
not implemented. But not for double and it works:
{code}
 sql(hiveContext, CREATE TABLE float_table (fl float, db double) row format 
 delimited fields terminated by ',')

 result - sql(hiveContext, SELECT * from float_table)

 head(result)
Error in readTypedObject(con, type) :
  Unsupported type for deserialization

 result - sql(hiveContext, LOAD DATA INPATH 'data.csv' INTO TABLE 
 float_table)

 result - sql(hiveContext, SELECT * from float_table)

 head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class jobj to a data.frame

 result - sql(hiveContext, SELECT db from float_table)

 head(result)
   db
1 1.1
2 2.0
{code}

 Float type coercion with hiveContext
 

 Key: SPARK-8840
 URL: https://issues.apache.org/jira/browse/SPARK-8840
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Evgeny SInelnikov

 Problem with +float+ type coercion on SparkR with hiveContext.
 {code}
  result - sql(hiveContext, SELECT offset, percentage from data limit 100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {code}
 This trouble looks like already exists (SPARK-2863 - Emulate Hive type
 coercion in native reimplementations of Hive functions) with same
 reason - not completed native reimplementations of Hive... not
 ...functions only.
 I used spark 1.4.0 binaries from official site:
 http://spark.apache.org/downloads.html
 And running it on:
 * Hortonworks HDP 2.2.0.0-2041
 * with Hive 0.14
 * with disabled hooks for Application Timeline Servers (ATSHook) in 
 hive-site.xml, commented:
 ** hive.exec.failure.hooks,
 ** hive.exec.post.hooks,
 ** hive.exec.pre.hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617278#comment-14617278
 ] 

Sean Owen commented on SPARK-7917:
--

Oops I mean executor. At least, I'm looking at Utils.getLocalFile, which 
ultimately calls getOrCreateLocalRootDirsImpl. You can see that on YARN, this 
uses YARN's dir and doesn't delete it on exit (YARN manages it). In the case 
that spark.local.dir config takes hold, you can also see it creates the dir if 
it doesn't exist and will delete it on shutdown in that case.

However I suppose a few possible cases jump out where the dir is not deleted:
- SPARK_EXECUTOR_DIRS is set
- spark.local.dir is set but it already exists

That is it seems to not delete dirs that were managed or set up externally. 
Does that explain this maybe?

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML

2015-07-07 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-8874:
---
Component/s: PySpark
 ML

 Add missing methods in Word2Vec ML
 --

 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8839) Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it

2015-07-07 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8839:
--
Shepherd: Yi Tian

 Thrift Sever will throw `java.util.NoSuchElementException: key not found` 
 exception when  many clients connect it
 -

 Key: SPARK-8839
 URL: https://issues.apache.org/jira/browse/SPARK-8839
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: SaintBacchus

 If there are about 150+ JDBC clients connectting to the Thrift Server,  some 
 clients will throw such exception:
 {code:title=Exception message|borderStyle=solid}
 java.sql.SQLException: java.util.NoSuchElementException: key not found: 
 90d93e56-7f6d-45bf-b340-e3ee09dd60fc
  at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:155)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7555) User guide update for ElasticNet

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7555:
-
Shepherd: Joseph K. Bradley

 User guide update for ElasticNet
 

 Key: SPARK-7555
 URL: https://issues.apache.org/jira/browse/SPARK-7555
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Shuo Xiang

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8704:
-
Description: 
std, mean to StandardScalerModel
~~getVectors, findSynonyms to Word2Vec Model~~
~~setFeatures and getFeatures to hashingTF~~

  was:
std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF


 Add missing methods in StandardScaler (ML and PySpark)
 --

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
 Fix For: 1.5.0


 std, mean to StandardScalerModel
 ~~getVectors, findSynonyms to Word2Vec Model~~
 ~~setFeatures and getFeatures to hashingTF~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7879) KMeans API for spark.ml Pipelines

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7879:
-
Priority: Critical  (was: Major)

 KMeans API for spark.ml Pipelines
 -

 Key: SPARK-7879
 URL: https://issues.apache.org/jira/browse/SPARK-7879
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Yu Ishikawa
Priority: Critical

 Create a K-Means API for the spark.ml Pipelines API.  This should wrap the 
 existing KMeans implementation in spark.mllib.
 This should be the first clustering method added to Pipelines, and it will be 
 important to consider [SPARK-7610] and think about designing the clustering 
 API.  We do not have to have abstractions from the beginning (and probably 
 should not) but should think far enough ahead so we can add abstractions 
 later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8874) Add missing methods in Word2Vec ML

2015-07-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617282#comment-14617282
 ] 

Apache Spark commented on SPARK-8874:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/7263

 Add missing methods in Word2Vec ML
 --

 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7337:
-
Shepherd: Xiangrui Meng

 FPGrowth algo throwing OutOfMemoryError
 ---

 Key: SPARK-7337
 URL: https://issues.apache.org/jira/browse/SPARK-7337
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
 Environment: Ubuntu
Reporter: Amit Gupta
 Attachments: FPGrowthBug.png


 When running FPGrowth algo with huge data in GBs and with numPartitions=500 
 then after some time it throws OutOfMemoryError.
 Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 
 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails 
 to create 500 tasks and create some internal calculated 17 tasks.
 Please refer to attachment - print screen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8400) ml.ALS doesn't handle -1 block size

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8400:
-
Shepherd: Xiangrui Meng

 ml.ALS doesn't handle -1 block size
 ---

 Key: SPARK-8400
 URL: https://issues.apache.org/jira/browse/SPARK-8400
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.1
Reporter: Xiangrui Meng

 Under spark.mllib, if number blocks is set to -1, we set the block size 
 automatically based on the input partition size. However, this behavior is 
 not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not 
 work, but no error messages will show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617202#comment-14617202
 ] 

Matt Cheah commented on SPARK-7917:
---

Just wanted to clarify: Worker shutdown, or executor shutdown? We have 
long-running workers, so we would want directories to clean up on executor 
shutdown, not worker shutdown.

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5016:
-
Shepherd: Xiangrui Meng

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
  Labels: clustering

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError

2015-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617290#comment-14617290
 ] 

Xiangrui Meng commented on SPARK-7337:
--

[~amit.gupta.niit-tech] Any updates?

 FPGrowth algo throwing OutOfMemoryError
 ---

 Key: SPARK-7337
 URL: https://issues.apache.org/jira/browse/SPARK-7337
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
 Environment: Ubuntu
Reporter: Amit Gupta
 Attachments: FPGrowthBug.png


 When running FPGrowth algo with huge data in GBs and with numPartitions=500 
 then after some time it throws OutOfMemoryError.
 Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 
 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails 
 to create 500 tasks and create some internal calculated 17 tasks.
 Please refer to attachment - print screen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617177#comment-14617177
 ] 

Sean Owen commented on SPARK-7917:
--

Right, this is about standalone. There's 
https://github.com/apache/spark/pull/3705 but that was in 1.3. IIRC it looks 
like this dir gets cleaned up pretty reliably on worker shutdown if the JVM can 
exit pretty normally, so I think the question is, does it still happen on 
master? and wha causes the normal code path not to happen?

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8874) Add missing methods in Word2Vec ML

2015-07-07 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8874:
--

 Summary: Add missing methods in Word2Vec ML
 Key: SPARK-8874
 URL: https://issues.apache.org/jira/browse/SPARK-8874
 Project: Spark
  Issue Type: New Feature
Reporter: Manoj Kumar


Add getVectors and findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8400) ml.ALS doesn't handle -1 block size

2015-07-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617292#comment-14617292
 ] 

Xiangrui Meng commented on SPARK-8400:
--

[~bryanc] Are you still working on this issue?

 ml.ALS doesn't handle -1 block size
 ---

 Key: SPARK-8400
 URL: https://issues.apache.org/jira/browse/SPARK-8400
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.1
Reporter: Xiangrui Meng

 Under spark.mllib, if number blocks is set to -1, we set the block size 
 automatically based on the input partition size. However, this behavior is 
 not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not 
 work, but no error messages will show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617115#comment-14617115
 ] 

Matt Cheah commented on SPARK-7917:
---

[~sowen] was there a patch specifically written in master or 1.4.x that fixed 
this? Can you link a specific PR?

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617166#comment-14617166
 ] 

Matt Cheah commented on SPARK-7917:
---

Definitely not 7503 - the PR there only did things for YARN mode: 
https://github.com/apache/spark/pull/6026

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-07 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617166#comment-14617166
 ] 

Matt Cheah edited comment on SPARK-7917 at 7/7/15 6:45 PM:
---

Definitely not SPARK-7503 - the PR there only did things for YARN mode: 
https://github.com/apache/spark/pull/6026


was (Author: mcheah):
Definitely not 7503 - the PR there only did things for YARN mode: 
https://github.com/apache/spark/pull/6026

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed

2015-07-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617184#comment-14617184
 ] 

Juan Rodríguez Hortalá commented on SPARK-8743:
---

Hi, 

I guess you already have a good test for this, but just in case this is a 
minimal example for this issue 
https://gist.github.com/juanrh/464155a3aabbf2c3afa8



 Deregister Codahale metrics for streaming when StreamingContext is closed 
 --

 Key: SPARK-8743
 URL: https://issues.apache.org/jira/browse/SPARK-8743
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das
Assignee: Neelesh Srinivas Salian
  Labels: starter

 Currently, when the StreamingContext is closed, the registered metrics are 
 not deregistered. If another streaming context is started, it throws a 
 warning saying that the metrics are already registered. 
 The solution is to deregister the metrics when streamingcontext is stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)

2015-07-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8704:
-
Description: Add std, mean to StandardScalerModel  (was: std, mean to 
StandardScalerModel
~~getVectors, findSynonyms to Word2Vec Model~~
~~setFeatures and getFeatures to hashingTF~~)

 Add missing methods in StandardScaler (ML and PySpark)
 --

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
 Fix For: 1.5.0


 Add std, mean to StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >