[jira] [Commented] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation

2015-09-09 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736346#comment-14736346
 ] 

Yu Ishikawa commented on SPARK-10276:
-

[~mengxr] should we add `@since` = to the class methods with `@classmethod` in 
PySpark? When I tried to do that, I got an error as follows. It seems that we 
can't rewrite {{___doc___}} of a `classmethod`.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 175, in MatrixFactorizationModel
@classmethod
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 62, in deco
f.__doc__ = f.__doc__.rstrip() + "\n\n%s.. versionadded:: %s" % (indent, 
version)
AttributeError: 'classmethod' object attribute '__doc__' is read-only
{noformat}

> Add @since annotation to pyspark.mllib.recommendation
> -
>
> Key: SPARK-10276
> URL: https://issues.apache.org/jira/browse/SPARK-10276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-10512:
---

 Summary: Fix @since when a function doesn't have doc
 Key: SPARK-10512
 URL: https://issues.apache.org/jira/browse/SPARK-10512
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Yu Ishikawa


When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

```
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10507) timestamp - timestamp

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736385#comment-14736385
 ] 

Sean Owen commented on SPARK-10507:
---

(Can you improve the title and description please?)

> timestamp - timestamp 
> --
>
> Key: SPARK-10507
> URL: https://issues.apache.org/jira/browse/SPARK-10507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>
> TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
> Error: Could not create ResultSet: Required field 'type' is unset! 
> Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".
> select cts - cts from tts 
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
> TimestampType does not support numeric operations
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY 
> '\n' 
>  STORED AS orc  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10507) timestamp - timestamp

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10507:
--
Priority: Minor  (was: Major)

> timestamp - timestamp 
> --
>
> Key: SPARK-10507
> URL: https://issues.apache.org/jira/browse/SPARK-10507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
> Error: Could not create ResultSet: Required field 'type' is unset! 
> Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".
> select cts - cts from tts 
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
> TimestampType does not support numeric operations
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY 
> '\n' 
>  STORED AS orc  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10502) tidy up the exception message text to be less verbose/"User friendly"

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10502:
--
Issue Type: Improvement  (was: Bug)

> tidy up the exception message text to be less verbose/"User friendly"
> -
>
> Key: SPARK-10502
> URL: https://issues.apache.org/jira/browse/SPARK-10502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> When a statement is parsed, it would be preferred is the exception text were 
> more aligned with other vendors re indicating the syntax error without the 
> inclusion of the verbose parse tree.
>  select tbint.rnum,tbint.cbint, nth_value( tbint.cbint, '4' ) over ( order by 
> tbint.rnum) from certstring.tbint 
> Errors:
> org.apache.spark.sql.AnalysisException: 
> Unsupported language features in query: select tbint.rnum,tbint.cbint, 
> nth_value( tbint.cbint, '4' ) over ( order by tbint.rnum) from 
> certstring.tbint
> TOK_QUERY 1, 0,40, 94
>   TOK_FROM 1, 36,40, 94
> TOK_TABREF 1, 38,40, 94
>   TOK_TABNAME 1, 38,40, 94
> certstring 1, 38,38, 94
> tbint 1, 40,40, 105
>   TOK_INSERT 0, -1,34, 0
> TOK_DESTINATION 0, -1,-1, 0
>   TOK_DIR 0, -1,-1, 0
> TOK_TMP_FILE 0, -1,-1, 0
> TOK_SELECT 1, 0,34, 12
>   TOK_SELEXPR 1, 2,4, 12
> . 1, 2,4, 12
>   TOK_TABLE_OR_COL 1, 2,2, 7
> tbint 1, 2,2, 7
>   rnum 1, 4,4, 13
>   TOK_SELEXPR 1, 6,8, 23
> . 1, 6,8, 23
>   TOK_TABLE_OR_COL 1, 6,6, 18
> tbint 1, 6,6, 18
>   cbint 1, 8,8, 24
>   TOK_SELEXPR 1, 11,34, 31
> TOK_FUNCTION 1, 11,34, 31
>   nth_value 1, 11,11, 31
>   . 1, 14,16, 47
> TOK_TABLE_OR_COL 1, 14,14, 42
>   tbint 1, 14,14, 42
> cbint 1, 16,16, 48
>   '4' 1, 19,19, 55
>   TOK_WINDOWSPEC 1, 25,34, 82
> TOK_PARTITIONINGSPEC 1, 27,33, 82
>   TOK_ORDERBY 1, 27,33, 82
> TOK_TABSORTCOLNAMEASC 1, 31,33, 82
>   . 1, 31,33, 82
> TOK_TABLE_OR_COL 1, 31,31, 77
>   tbint 1, 31,31, 77
> rnum 1, 33,33, 83
> scala.NotImplementedError: No parse rules for ASTNode type: 882, text: 
> TOK_WINDOWSPEC :
> TOK_WINDOWSPEC 1, 25,34, 82
>   TOK_PARTITIONINGSPEC 1, 27,33, 82
> TOK_ORDERBY 1, 27,33, 82
>   TOK_TABSORTCOLNAMEASC 1, 31,33, 82
> . 1, 31,33, 82
>   TOK_TABLE_OR_COL 1, 31,31, 77
> tbint 1, 31,31, 77
>   rnum 1, 33,33, 83
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10111) StringIndexerModel lacks of method "labels"

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10111.
---
Resolution: Duplicate

> StringIndexerModel lacks of method "labels"
> ---
>
> Key: SPARK-10111
> URL: https://issues.apache.org/jira/browse/SPARK-10111
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Kai Sasaki
>
> Missing {{labels}} property of {{StringIndexerModel}} in pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-10512:

Description: 
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}

  was:
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

```
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
```


> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-10512:

Description: 
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None}} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}

  was:
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}


> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10512:


Assignee: (was: Apache Spark)

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10512:


Assignee: Apache Spark

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736389#comment-14736389
 ] 

Apache Spark commented on SPARK-10512:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8667

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10444) Remove duplication in Mesos schedulers

2015-09-09 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736434#comment-14736434
 ] 

Iulian Dragos commented on SPARK-10444:
---

Another example of duplicated logic: https://github.com/apache/spark/pull/8639

> Remove duplication in Mesos schedulers
> --
>
> Key: SPARK-10444
> URL: https://issues.apache.org/jira/browse/SPARK-10444
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.0
>Reporter: Iulian Dragos
>  Labels: refactoring
>
> Currently coarse-grained and fine-grained Mesos schedulers don't share much 
> code, and that leads to inconsistencies. For instance:
> - only coarse-grained mode respects {{spark.cores.max}}, see SPARK-9873
> - only coarse-grained mode blacklists slaves that fail repeatedly, but that 
> seams like generally useful
> - constraints and memory checking are done on both sides (code is shared 
> though)
> - framework re-registration (master election) is only done for cluster-mode 
> deployment
> We should find a better design that groups together common concerns and 
> generally improves the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.

2015-09-09 Thread Tang Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tang Yan updated SPARK-7825:

Affects Version/s: (was: 1.3.1)
   (was: 1.2.2)
   (was: 1.2.1)
   (was: 1.3.0)
   (was: 1.2.0)

> Poor performance in Cross Product due to no combine operations for small 
> files.
> ---
>
> Key: SPARK-7825
> URL: https://issues.apache.org/jira/browse/SPARK-7825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tang Yan
>
> Dealing with  Cross Product, if one  table has many small files, spark sql 
> has to handle so many tasks which will lead to poor performance, while Hive 
> has a CombineHiveInputFormat which can combine small files to decrease the 
> task  number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10227) sbt build on Scala 2.11 fails

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10227.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8433
[https://github.com/apache/spark/pull/8433]

> sbt build on Scala 2.11 fails
> -
>
> Key: SPARK-10227
> URL: https://issues.apache.org/jira/browse/SPARK-10227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Luc Bourlier
> Fix For: 1.6.0
>
>
> Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of 
> 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) 
> fails to  build with sbt on Scala 2.11.
> Most of the warning are about the {{@transient}} annotation not being set on 
> relevant elements, and a few pointing to some potential bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10227) sbt build on Scala 2.11 fails

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10227:
--
Assignee: Luc Bourlier

> sbt build on Scala 2.11 fails
> -
>
> Key: SPARK-10227
> URL: https://issues.apache.org/jira/browse/SPARK-10227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Luc Bourlier
>Assignee: Luc Bourlier
> Fix For: 1.6.0
>
>
> Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of 
> 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) 
> fails to  build with sbt on Scala 2.11.
> Most of the warning are about the {{@transient}} annotation not being set on 
> relevant elements, and a few pointing to some potential bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10316:
--
Assignee: Wenchen Fan

> respect non-deterministic expressions in PhysicalOperation
> --
>
> Key: SPARK-10316
> URL: https://issues.apache.org/jira/browse/SPARK-10316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> We did a lot of special handling for non-deterministic expressions in 
> Optimizer. However, PhysicalOperation just collects all Projects and Filters 
> and messed it up. We should respect the operators order caused by 
> non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4752:
-
Assignee: Alexander Ulanov

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10327) Cache Table is not working while subquery has alias in its project list

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10327:
--
Assignee: Cheng Hao

> Cache Table is not working while subquery has alias in its project list
> ---
>
> Key: SPARK-10327
> URL: https://issues.apache.org/jira/browse/SPARK-10327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> Code to reproduce that:
> {code}
> import org.apache.spark.sql.hive.execution.HiveTableScan
> sql("select key, value, key + 1 from src").registerTempTable("abc")
> cacheTable("abc")
> val sparkPlan = sql(
>   """select a.key, b.key, c.key from
> |abc a join abc b on a.key=b.key
> |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan
> assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size 
> === 3) // failed
> assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // 
> failed
> {code}
> The query plan like:
> {code}
> == Parsed Logical Plan ==
> 'Project 
> [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
>  'Join Inner, Some(('a.key = 'c.key))
>   'Join Inner, Some(('a.key = 'b.key))
>'UnresolvedRelation [abc], Some(a)
>'UnresolvedRelation [abc], Some(b)
>   'UnresolvedRelation [abc], Some(c)
> == Analyzed Logical Plan ==
> key: int, key: int, key: int
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Join Inner, Some((key#14 = key#61))
>Subquery a
> Subquery abc
>  Project [key#14,value#15,(key#14 + 1) AS _c2#16]
>   MetastoreRelation default, src, None
>Subquery b
> Subquery abc
>  Project [key#61,value#62,(key#61 + 1) AS _c2#58]
>   MetastoreRelation default, src, None
>   Subquery c
>Subquery abc
> Project [key#66,value#67,(key#66 + 1) AS _c2#63]
>  MetastoreRelation default, src, None
> == Optimized Logical Plan ==
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Project [key#14,key#61]
>Join Inner, Some((key#14 = key#61))
> Project [key#14]
>  InMemoryRelation [key#14,value#15,_c2#16], true, 1, 
> StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 
> 1) AS _c2#16]), Some(abc)
> Project [key#61]
>  MetastoreRelation default, src, None
>   Project [key#66]
>MetastoreRelation default, src, None
> == Physical Plan ==
> TungstenProject [key#14,key#61,key#66]
>  BroadcastHashJoin [key#14], [key#66], BuildRight
>   TungstenProject [key#14,key#61]
>BroadcastHashJoin [key#14], [key#61], BuildRight
> ConvertToUnsafe
>  InMemoryColumnarTableScan [key#14], (InMemoryRelation 
> [key#14,value#15,_c2#16], true, 1, StorageLevel(true, true, false, true, 
> 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
> ConvertToUnsafe
>  HiveTableScan [key#61], (MetastoreRelation default, src, None)
>   ConvertToUnsafe
>HiveTableScan [key#66], (MetastoreRelation default, src, None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10441) Cannot write timestamp to JSON

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10441:
--
Assignee: Yin Huai

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10501) support UUID as an atomic type

2015-09-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10501:
--
   Priority: Minor  (was: Major)
Component/s: SQL
 Issue Type: Improvement  (was: Bug)

> support UUID as an atomic type
> --
>
> Key: SPARK-10501
> URL: https://issues.apache.org/jira/browse/SPARK-10501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jon Haddad
>Priority: Minor
>
> It's pretty common to use UUIDs instead of integers in order to avoid 
> distributed counters.  
> I've added this, which at least lets me load dataframes that use UUIDs that I 
> can cast to strings:
> {code}
> class UUIDType(AtomicType):
> pass
> _type_mappings[UUID] = UUIDType
> _atomic_types.append(UUIDType)
> {code}
> But if I try to do anything else with the UUIDs, like this:
> {code}
> ratings.select("userid").distinct().collect()
> {code}
> I get this pile of fun: 
> {code}
> scala.MatchError: UUIDType (of class 
> org.apache.spark.sql.cassandra.types.UUIDType$)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9564) Spark 1.5.0 Testing Plan

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736594#comment-14736594
 ] 

Sean Owen commented on SPARK-9564:
--

Now that 1.5.0 is released, can this be closed? 
Or else I'm unclear on the role of these umbrellas and would like to rehash 
that conversation again.

> Spark 1.5.0 Testing Plan
> 
>
> Key: SPARK-9564
> URL: https://issues.apache.org/jira/browse/SPARK-9564
> Project: Spark
>  Issue Type: Epic
>  Components: Build, Tests
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> This is an epic for Spark 1.5.0 release QA plans for tracking various 
> components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10513) Springleaf Marketing Response

2015-09-09 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10513:
---

 Summary: Springleaf Marketing Response
 Key: SPARK-10513
 URL: https://issues.apache.org/jira/browse/SPARK-10513
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yanbo Liang


Apply ML pipeline API to Springleaf Marketing Response 
(https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736648#comment-14736648
 ] 

Yanbo Liang commented on SPARK-10513:
-

I will work on this dataset.

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-09-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736695#comment-14736695
 ] 

yuhao yang commented on SPARK-9578:
---

A better choice for LDA seems to be lemmatization. Yet that requires pos tags 
and extra vocabulary. 
If there's no other ongoing effort on this, I'd like to start with a simpler 
porter implementation, then try to enhance it to snowball. [~josephkb] 
The plan is to cover the most general cases with shorter code. After all, MLlib 
is not specific for NLP.

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Akash Mishra (JIRA)
Akash Mishra created SPARK-10514:


 Summary: Minimum ratio of registered resources [ 
spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
Grained mode
 Key: SPARK-10514
 URL: https://issues.apache.org/jira/browse/SPARK-10514
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Akash Mishra


"spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
overriding the "sufficientResourcesRegistered" function which is true by 
default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time

2015-09-09 Thread N Campbell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N Campbell updated SPARK-10507:
---
Summary: reject temporal expressions such as timestamp - timestamp at parse 
time   (was: timestamp - timestamp )

> reject temporal expressions such as timestamp - timestamp at parse time 
> 
>
> Key: SPARK-10507
> URL: https://issues.apache.org/jira/browse/SPARK-10507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
> Error: Could not create ResultSet: Required field 'type' is unset! 
> Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".
> select cts - cts from tts 
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
> TimestampType does not support numeric operations
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY 
> '\n' 
>  STORED AS orc  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time

2015-09-09 Thread N Campbell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N Campbell updated SPARK-10507:
---
Description: 
TIMESTAMP - TIMESTAMP in ISO-SQL should return an interval type which SPARK 
does not support.. 

A similar expression in Hive 0.13 fails with Error: Could not create ResultSet: 
Required field 'type' is unset! Struct:TPrimitiveTypeEntry(type:null) and SPARK 
has similar "challenges". While Hive 1.2.1 has added some interval type support 
it is far from complete with respect to ISO-SQL. 

The ability to compute the period of time (years, days, weeks, hours, ...) 
between timestamps or add/substract intervals from a timestamp are extremely 
common in business applications. 

Currently, a value expression such as select timestampcol - timestampcol from t 
will fail during execution and not parse time. While the error thrown states 
that fact, it would better for those value expressions to be rejected at parse 
time along with indicating the expression that is causing the parser error.


Operation: execute
Errors:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 
(TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
TimestampType does not support numeric operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY '\n' 
 STORED AS orc  ;


  was:
TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
Error: Could not create ResultSet: Required field 'type' is unset! 
Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".

select cts - cts from tts 



Operation: execute
Errors:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 
(TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
TimestampType does not support numeric operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scal

[jira] [Created] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-10515:
-

 Summary: When kill executor, there is no need to seed 
RequestExecutors to AM
 Key: SPARK-10515
 URL: https://issues.apache.org/jira/browse/SPARK-10515
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: KaiXinXIaoLei
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10515:


Assignee: (was: Apache Spark)

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736853#comment-14736853
 ] 

Apache Spark commented on SPARK-10515:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/8668

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10515:


Assignee: Apache Spark

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736869#comment-14736869
 ] 

Glenn Strycker commented on SPARK-10493:


The RDD I am using has the form ((String, String), (String, Long, Long, Long, 
Long)), so the key is actually a (String, String) tuple.

Are there any sorting operations that would require implicit ordering, buried 
under the covers of the reduceByKey operation, that would be causing the 
problems with non-uniqueness?

Does partitionBy(HashPartitioner(numPartitions)) not work with a (String, 
String) tuple?  I've not had any noticeable problems with this before, although 
that would certainly explain errors in reduceByKey and distinct.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736879#comment-14736879
 ] 

Sean Owen commented on SPARK-10493:
---

That much should be OK. 
zipPartitions only makes sense if you have two ordered, identically partitioned 
data sets. Is that true of the temp RDDs?
Otherwise that could be a source of nondeterminism.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8793) error/warning with pyspark WholeTextFiles.first

2015-09-09 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll resolved SPARK-8793.
--
Resolution: Not A Problem

this is no longer occurring.

> error/warning with pyspark WholeTextFiles.first
> ---
>
> Key: SPARK-8793
> URL: https://issues.apache.org/jira/browse/SPARK-8793
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Diana Carroll
>Priority: Minor
> Attachments: wholefilesbug.txt
>
>
> In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working 
> correctly in pyspark.  It works fine in Scala.
> I created a directory with two tiny, simple text files.  
> this works:
> {code}sc.wholeTextFiles("testdata").collect(){code}
> this doesn't:
> {code}sc.wholeTextFiles("testdata").first(){code}
> The main error message is:
> {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in 
> stage 12.0 (TID 12)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
> {code}
> I will attach the full stack trace to the JIRA.
> I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0).  Tested in both Python 2.6 
> and 2.7, same results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736923#comment-14736923
 ] 

Apache Spark commented on SPARK-2960:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8669

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10428) Struct fields read from parquet are mis-aligned

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736949#comment-14736949
 ] 

Apache Spark commented on SPARK-10428:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8670

> Struct fields read from parquet are mis-aligned
> ---
>
> Key: SPARK-10428
> URL: https://issues.apache.org/jira/browse/SPARK-10428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> val df1 = sqlContext
> .range(1)
> .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
> .coalesce(1)
> val df2 = sqlContext
>   .range(1, 2)
>   .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) 
> AS s")
>   .coalesce(1)
> df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
> df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
> {code}
> {code}
> sqlContext.read.option("mergeSchema", 
> "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", 
> "s.d", “p").show
> +---+---+++---+
> |  a|  b|   c|   d|  p|
> +---+---+++---+
> |  0|  3|null|null|  1|
> |  1|  2|   3|   4|  2|
> +---+---+++---+
> {code}
> Looks like the problem is at 
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204,
>  we do padding when global schema has more struct fields than local parquet 
> file's schema. However, when we read field from parquet, we still use 
> parquet's local schema and then we put the value of {{d}} to the wrong slot.
> I tried master. Looks like this issue is resolved by 
> https://github.com/apache/spark/pull/8509. We need to decide if we want to 
> back port that to branch 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736947#comment-14736947
 ] 

Apache Spark commented on SPARK-10301:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8670

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> We hit this issue when reading a complex Parquet dateset without turning on 
> schema merging.  The data set consists of Parquet files with different but 
> compatible schemas.  In this way, the schema of the dataset is defined by 
> either a summary file or a random physical Parquet file if no summary files 
> are available.  Apparently, this schema may not containing all fields 
> appeared in all physicla files.
> Parquet was designed with schema evolution and column pruning in mind, so it 
> should be legal for a user to use a tailored schema to read the dataset to 
> save disk IO.  For example, say we have a Parquet dataset consisting of two 
> physical Parquet files with the following two schemas:
> {noformat}
> message m0 {
>   optional group f0 {
> optional int64 f00;
> optional int64 f01;
>   }
> }
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f01;
> optional int64 f02;
>   }
>   optional double f1;
> }
> {noformat}
> Users should be allowed to read the dataset with the following schema:
> {noformat}
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f02;
>   }
> }
> {noformat}
> so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
> expressed by the following {{spark-shell}} snippet:
> {noformat}
> import sqlContext._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{LongType, StructType}
> val path = "/tmp/spark/parquet"
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1)
> .write.mode("overwrite").parquet(path)
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", 
> "CAST(id AS DOUBLE) AS f1").coalesce(1)
> .write.mode("append").parquet(path)
> val tailoredSchema =
>   new StructType()
> .add(
>   "f0",
>   new StructType()
> .add("f01", LongType, nullable = true)
> .add("f02", LongType, nullable = true),
>   nullable = true)
> read.schema(tailoredSchema).parquet(path).show()
> {noformat}
> Expected output should be:
> {noformat}
> ++
> |  f0|
> ++
> |[0,null]|
> |[1,null]|
> |[2,null]|
> |   [0,0]|
> |   [1,1]|
> |   [2,2]|
> ++
> {noformat}
> However, current 1.5-SNAPSHOT version throws the following exception:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
> org.apache.spark.sql.execution.SparkPlan$$an

[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736961#comment-14736961
 ] 

Davies Liu commented on SPARK-10512:


As we discussed here 
https://github.com/apache/spark/pull/8657#discussion_r38992400, we should add a 
doc for those public API, instead putting a workaround in @since. 

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10512.
--
Resolution: Won't Fix

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736973#comment-14736973
 ] 

Yu Ishikawa commented on SPARK-10512:
-

[~davies] oh, I see. Thank you for letting me know.

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7874:
---

Assignee: Apache Spark

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Assignee: Apache Spark
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7874:
---

Assignee: (was: Apache Spark)

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-09 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736986#comment-14736986
 ] 

Don Drake commented on SPARK-10441:
---

Got it, thanks for the clarification.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736984#comment-14736984
 ] 

Apache Spark commented on SPARK-7874:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/8671

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737001#comment-14737001
 ] 

Glenn Strycker commented on SPARK-10493:


In this example, our RDDs are partitioned with a hash partition, but are not 
ordered.

I think you may be confusing zipPartitions with zipWithIndex... zipPartitions 
is used to merge two sets partition-wise, which enables a union without 
requiring any shuffles.  We use zipPartitions throughout our code to make 
things fast, and then apply partitionBy() periodically to do the shuffles only 
when needed.  No ordering is required.  We're also not concerned with 
uniqueness at this point (in fact, for my application I want to keep 
multiplicity UNTIL the reduceByKey step), so hash collisions and such are ok 
for our zipPartition union step.

As I've been investigating this the past few days, I went ahead and made an 
intermediate temp RDD that does the zipPartitions, runs partitionBy, persists, 
checkpoints, and then materializes the RDD.  So I think this rules out that 
zipPartitions is causing the problems downstream for the main RDD, which only 
runs reduceByKey on the intermediate RDD.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10516) Add values as a property to DenseVector in PySpark

2015-09-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10516:
-

 Summary: Add values as a property to DenseVector in PySpark
 Key: SPARK-10516
 URL: https://issues.apache.org/jira/browse/SPARK-10516
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Priority: Trivial


We use `values` in Scala but `array` in PySpark. We should add `values` as a 
property to match Scala implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA
Maciej Bryński created SPARK-10517:
--

 Summary: Console "Output" field is empty when using 
DataFrameWriter.json
 Key: SPARK-10517
 URL: https://issues.apache.org/jira/browse/SPARK-10517
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Maciej Bryński
Priority: Minor


On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: screenshot-1.png

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Description: 
On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.

Screenshot attached,

  was:
On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.


> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: (was: screenshot-1.png)

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: screenshot-1.png

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737051#comment-14737051
 ] 

Sean Owen commented on SPARK-10493:
---

I think you still have the same issue with zipPartitions, unless you have an 
ordering on the RDD, since the partitions may not appear in any particular 
order, in which case zipping them may give different results. It may still not 
be the issue though, since a lot of partitionings will happen to have the 
assumed, same order anyway.

Why would this necessarily be better than union()? if you have the same # of 
partitions and same partitioning you shouldn't have a shuffle. That's also by 
the by.

I can't reproduce this in a simple, similar local example. I think there's 
something else different between what you're doing and the code snippet here.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737050#comment-14737050
 ] 

Sean Owen commented on SPARK-10493:
---

I think you still have the same issue with zipPartitions, unless you have an 
ordering on the RDD, since the partitions may not appear in any particular 
order, in which case zipping them may give different results. It may still not 
be the issue though, since a lot of partitionings will happen to have the 
assumed, same order anyway.

Why would this necessarily be better than union()? if you have the same # of 
partitions and same partitioning you shouldn't have a shuffle. That's also by 
the by.

I can't reproduce this in a simple, similar local example. I think there's 
something else different between what you're doing and the code snippet here.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737055#comment-14737055
 ] 

Glenn Strycker edited comment on SPARK-10493 at 9/9/15 3:40 PM:


I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simple code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)



was (Author: glenn.stryc...@gmail.com):
I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simply code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)


> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker updated SPARK-10493:
---
Attachment: reduceByKey_example_001.scala

I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simply code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)


> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737095#comment-14737095
 ] 

Apache Spark commented on SPARK-10514:
--

User 'SleepyThread' has created a pull request for this issue:
https://github.com/apache/spark/pull/8672

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10514:


Assignee: (was: Apache Spark)

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10514:


Assignee: Apache Spark

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>Assignee: Apache Spark
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Akash Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737103#comment-14737103
 ] 

Akash Mishra commented on SPARK-10514:
--

Created a pull request https://github.com/apache/spark/pull/8672 for this bug.

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10117) Implement SQL data source API for reading LIBSVM data

2015-09-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10117.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8537
[https://github.com/apache/spark/pull/8537]

> Implement SQL data source API for reading LIBSVM data
> -
>
> Key: SPARK-10117
> URL: https://issues.apache.org/jira/browse/SPARK-10117
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
> Fix For: 1.6.0
>
>
> It is convenient to implement data source API for LIBSVM format to have a 
> better integration with DataFrames and ML pipeline API.
> {code}
> import org.apache.spark.ml.source.libsvm._
> val training = sqlContext.read
>   .format("libsvm")
>   .option("numFeatures", "1")
>   .load("path")
> {code}
> This JIRA covers the following:
> 1. Read LIBSVM data as a DataFrame with two columns: label: Double and 
> features: Vector.
> 2. Accept `numFeatures` as an option.
> 3. The implementation should live under `org.apache.spark.ml.source.libsvm`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735964#comment-14735964
 ] 

Yin Huai edited comment on SPARK-10495 at 9/9/15 4:40 PM:
--

The bug itself is fixed by https://issues.apache.org/jira/browse/SPARK-10441.


was (Author: yhuai):
I think it is fixed by https://issues.apache.org/jira/browse/SPARK-10441.

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737161#comment-14737161
 ] 

Yin Huai commented on SPARK-10495:
--

Since we shipped Spark 1.5.0 with this issue, it will be good to have a way to 
read this format in 1.5.1.

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185
 ] 

Davies Liu edited comment on SPARK-10309 at 9/9/15 4:53 PM:


[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.explain()` 


was (Author: davies):
[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.eplain()` 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185
 ] 

Davies Liu commented on SPARK-10309:


[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.eplain()` 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10518) Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils

2015-09-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10518:
-

 Summary: Update code examples in spark.ml user guide to use LIBSVM 
data source instead of MLUtils
 Key: SPARK-10518
 URL: https://issues.apache.org/jira/browse/SPARK-10518
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Minor


SPARK-10117 was merged, we should use LIBSVM data source in the example code in 
spark.ml user guide, e.g.,

{code}
val df = sqlContext.read.format("libsvm").load("path")
{code}

instead of

{code}
val df = MLUtils.loadLibSVMFile(sc, "path").toDF()
{code}

We should update the following:

{code}
ml-ensembles.md:40:val data = MLUtils.loadLibSVMFile(sc,
ml-ensembles.md:87:RDD data = MLUtils.loadLibSVMFile(jsc.sc(),
ml-features.md:866:val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
ml-features.md:892:JavaRDD rdd = MLUtils.loadLibSVMFile(sc.sc(),
ml-features.md:917:data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
ml-features.md:940:val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
ml-features.md:964:  MLUtils.loadLibSVMFile(jsc.sc(), 
"data/mllib/sample_libsvm_data.txt").toJavaRDD();
ml-features.md:985:data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
ml-features.md:1022:val data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
ml-features.md:1047:  MLUtils.loadLibSVMFile(jsc.sc(), 
"data/mllib/sample_libsvm_data.txt").toJavaRDD();
ml-features.md:1068:data = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
ml-linear-methods.md:44:val training = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
ml-linear-methods.md:84:DataFrame training = 
sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), 
LabeledPoint.class);
ml-linear-methods.md:110:training = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt").toDF()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10495:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1)

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10495) For json data source, date values are saved as int strings

2015-09-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10495:
-
Target Version/s: 1.5.1
Priority: Blocker  (was: Critical)

> For json data source, date values are saved as int strings
> --
>
> Key: SPARK-10495
> URL: https://issues.apache.org/jira/browse/SPARK-10495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> {code}
> val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j")
> df.write.format("json").save("/tmp/testJson")
> sc.textFile("/tmp/testJson").collect.foreach(println)
> {"i":1,"j":"-25567"}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10481) SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found

2015-09-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10481.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 1.6.0

> SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found
> 
>
> Key: SPARK-10481
> URL: https://issues.apache.org/jira/browse/SPARK-10481
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.1
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> It happens when SPARK_PREPEND_CLASSES is set and run spark on yarn.
> If SPARK_PREPEND_CLASSES, spark-yarn related jar won't be found. Because the 
> org.apache.spark.deploy.Client is detected as individual class rather class 
> in jar. 
> {code}
> 15/09/08 08:57:10 ERROR SparkContext: Error initializing SparkContext.
> java.util.NoSuchElementException: head of empty list
>   at scala.collection.immutable.Nil$.head(List.scala:337)
>   at scala.collection.immutable.Nil$.head(List.scala:334)
>   at 
> org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$sparkJar(Client.scala:1048)
>   at 
> org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:1159)
>   at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:534)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:645)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:514)
>   at com.zjffdu.tutorial.spark.WordCount$.main(WordCount.scala:24)
>   at com.zjffdu.tutorial.spark.WordCount.main(WordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737252#comment-14737252
 ] 

Sean Owen commented on SPARK-10493:
---

What do you mean that it's not collapsing key pairs? the output of temp5 shows 
the same keys and same count in both cases. The keys are distinct and in order 
after {{temp5.sortByKey(true).collect().foreach(println)}}

Here's my simplistic test case which gives a consistent count when I run the 
code above on this:

{code}
val bWords = sc.broadcast(sc.textFile("/usr/share/dict/words").collect())

val tempRDD1 = sc.parallelize(1 to 1000, 10).mapPartitionsWithIndex { (i, 
ns) =>
  val words = bWords.value
  val random = new scala.util.Random(i)
  ns.map { n => 
val a = words(random.nextInt(words.length))
val b = words(random.nextInt(words.length))
val c = words(random.nextInt(words.length))
val d = random.nextInt(words.length)
val e = random.nextInt(words.length)
val f = random.nextInt(words.length)
val g = random.nextInt(words.length)
((a, b), (c, d, e, f, g))
  }
}

val tempRDD2 = sc.parallelize(1 to 1000, 10).mapPartitionsWithIndex { (i, 
ns) =>
  val words = bWords.value
  val random = new scala.util.Random(i)
  ns.map { n => 
val a = words(random.nextInt(words.length))
val b = words(random.nextInt(words.length))
val c = words(random.nextInt(words.length))
val d = random.nextInt(words.length)
val e = random.nextInt(words.length)
val f = random.nextInt(words.length)
val g = random.nextInt(words.length)
((a, b), (c, d, e, f, g))
  }
}
{code}

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737296#comment-14737296
 ] 

Glenn Strycker commented on SPARK-10493:


[~srowen], the code I attached did run correctly.  However, I have similar code 
that I run in Yarn via spark-submit that is NOT returning 1 record per key.

I mean that when I run code spark-submit that generates temp5, I get a set as 
follows:

{noformat}
((cluster021,cluster023),(cluster021,1,2,1,3))
((cluster031,cluster033),(cluster031,1,2,1,3))
((cluster041,cluster043),(cluster041,5,2,1,3))
((cluster041,cluster043),(cluster041,1,2,1,3))
((cluster041,cluster044),(cluster041,3,2,1,3))
((cluster041,cluster044),(cluster041,4,2,1,3))
((cluster051,cluster052),(cluster051,6,2,1,3))
((cluster051,cluster053),(cluster051,1,2,1,3))
((cluster051,cluster054),(cluster051,1,2,1,3))
((cluster051,cluster055),(cluster051,1,2,1,3))
((cluster051,cluster056),(cluster051,1,2,1,3))
((cluster052,cluster053),(cluster051,1,1,1,2))
((cluster052,cluster054),(cluster051,8,1,1,2))
((cluster053,cluster054),(cluster051,7,1,1,2))
((cluster055,cluster056),(cluster051,9,1,1,2))
{noformat}

note that the keys (cluster041,cluster043) or (cluster041,cluster044) have 2 
records each in the results, which should NEVER happen!

Here is what I expected (which is what I see in my example code I attached to 
this ticket, which ran successfully in spark-shell):

{noformat}
((cluster021,cluster023),(cluster021,1,2,1,3))
((cluster031,cluster033),(cluster031,1,2,1,3))
((cluster041,cluster043),(cluster041,6,2,1,3))
((cluster041,cluster044),(cluster041,7,2,1,3))
((cluster051,cluster052),(cluster051,6,2,1,3))
((cluster051,cluster053),(cluster051,1,2,1,3))
((cluster051,cluster054),(cluster051,1,2,1,3))
((cluster051,cluster055),(cluster051,1,2,1,3))
((cluster051,cluster056),(cluster051,1,2,1,3))
((cluster052,cluster053),(cluster051,1,1,1,2))
((cluster052,cluster054),(cluster051,8,1,1,2))
((cluster053,cluster054),(cluster051,7,1,1,2))
((cluster055,cluster056),(cluster051,9,1,1,2))
{noformat}

You are right that in my example, distinct is not really the issue, since the 
records with the same keys do have different values.  The issue is with 
reduceByKey, which is NOT reducing my RDDs correctly and resulting in 1 record 
per key.  Does reduceByKey not support (String, String) keys?

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10461) make sure `input.primitive` is always variable name not code at GenerateUnsafeProjection

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10461.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8613
[https://github.com/apache/spark/pull/8613]

> make sure `input.primitive` is always variable name not code at 
> GenerateUnsafeProjection
> 
>
> Key: SPARK-10461
> URL: https://issues.apache.org/jira/browse/SPARK-10461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10519:


 Summary: Investigate if we should encode timezone information to a 
timestamp value stored in JSON
 Key: SPARK-10519
 URL: https://issues.apache.org/jira/browse/SPARK-10519
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai
Priority: Minor


Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
information and the string representation of a timestamp stored in JSON 
implicitly using the local timezone (see 
[1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
 
[2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
 
[3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
 
[4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
 This behavior may cause the data consumers got different values when they are 
in a different timezone with the data producers.

Since JSON is string based, if we encode timezone information to timestamp 
value, downstream applications may need to change their code (for example, 
java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
hh:mm:ss\[.f...]}}).

We should investigate what we should do about this issue. Right now, I can 
think of three options:

1. Encoding timezone info in the timestamp value, which can break user code and 
may change the semantic of timestamp (our timestamp value is timezone-less).
2. When saving a timestamp value to json, we treat this value as a value in the 
local timezone and convert it to UTC time. Then, when save the data, we do not 
encode timezone info in the value.
3. We do not change our current behavior. But, in our doc, we explicitly say 
that users need to use a single timezone for their datasets (e.g. always use 
UTC time). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737311#comment-14737311
 ] 

Yin Huai commented on SPARK-10519:
--

cc [~davies]

I feel that option 3 is better.

> Investigate if we should encode timezone information to a timestamp value 
> stored in JSON
> 
>
> Key: SPARK-10519
> URL: https://issues.apache.org/jira/browse/SPARK-10519
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
> information and the string representation of a timestamp stored in JSON 
> implicitly using the local timezone (see 
> [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
>  
> [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
>  
> [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
>  
> [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
>  This behavior may cause the data consumers got different values when they 
> are in a different timezone with the data producers.
> Since JSON is string based, if we encode timezone information to timestamp 
> value, downstream applications may need to change their code (for example, 
> java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
> hh:mm:ss\[.f...]}}).
> We should investigate what we should do about this issue. Right now, I can 
> think of three options:
> 1. Encoding timezone info in the timestamp value, which can break user code 
> and may change the semantic of timestamp (our timestamp value is 
> timezone-less).
> 2. When saving a timestamp value to json, we treat this value as a value in 
> the local timezone and convert it to UTC time. Then, when save the data, we 
> do not encode timezone info in the value.
> 3. We do not change our current behavior. But, in our doc, we explicitly say 
> that users need to use a single timezone for their datasets (e.g. always use 
> UTC time). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10519:
-
Target Version/s: 1.6.0

> Investigate if we should encode timezone information to a timestamp value 
> stored in JSON
> 
>
> Key: SPARK-10519
> URL: https://issues.apache.org/jira/browse/SPARK-10519
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
> information and the string representation of a timestamp stored in JSON 
> implicitly using the local timezone (see 
> [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
>  
> [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
>  
> [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
>  
> [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
>  This behavior may cause the data consumers got different values when they 
> are in a different timezone with the data producers.
> Since JSON is string based, if we encode timezone information to timestamp 
> value, downstream applications may need to change their code (for example, 
> java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
> hh:mm:ss\[.f...]}}).
> We should investigate what we should do about this issue. Right now, I can 
> think of three options:
> 1. Encoding timezone info in the timestamp value, which can break user code 
> and may change the semantic of timestamp (our timestamp value is 
> timezone-less).
> 2. When saving a timestamp value to json, we treat this value as a value in 
> the local timezone and convert it to UTC time. Then, when save the data, we 
> do not encode timezone info in the value.
> 3. We do not change our current behavior. But, in our doc, we explicitly say 
> that users need to use a single timezone for their datasets (e.g. always use 
> UTC time). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737375#comment-14737375
 ] 

Davies Liu commented on SPARK-10519:


+1 for 3, user have the ability to control timezone, it's also compatible. 

> Investigate if we should encode timezone information to a timestamp value 
> stored in JSON
> 
>
> Key: SPARK-10519
> URL: https://issues.apache.org/jira/browse/SPARK-10519
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
> information and the string representation of a timestamp stored in JSON 
> implicitly using the local timezone (see 
> [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
>  
> [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
>  
> [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
>  
> [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
>  This behavior may cause the data consumers got different values when they 
> are in a different timezone with the data producers.
> Since JSON is string based, if we encode timezone information to timestamp 
> value, downstream applications may need to change their code (for example, 
> java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
> hh:mm:ss\[.f...]}}).
> We should investigate what we should do about this issue. Right now, I can 
> think of three options:
> 1. Encoding timezone info in the timestamp value, which can break user code 
> and may change the semantic of timestamp (our timestamp value is 
> timezone-less).
> 2. When saving a timestamp value to json, we treat this value as a value in 
> the local timezone and convert it to UTC time. Then, when save the data, we 
> do not encode timezone info in the value.
> 3. We do not change our current behavior. But, in our doc, we explicitly say 
> that users need to use a single timezone for their datasets (e.g. always use 
> UTC time). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10474:
---
Target Version/s: 1.6.0, 1.5.1
Priority: Blocker  (was: Critical)

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up

2015-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737416#comment-14737416
 ] 

Thomas Graves commented on SPARK-9924:
--

[~vanzin] Any reason this wasn't picked back into spark 1.5 branch?

> checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
> ---
>
> Key: SPARK-9924
> URL: https://issues.apache.org/jira/browse/SPARK-9924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Rohit Agarwal
>Assignee: Rohit Agarwal
> Fix For: 1.6.0
>
>
> {{checkForLogs}} and {{cleanLogs}} are scheduled using 
> {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution 
> takes more time than the interval at which they are scheduled, they get piled 
> up.
> This is a problem on its own but the existence of SPARK-7189 makes it even 
> worse. Let's say there is an eventLog which takes 15s to parse and which 
> happens to be the last modified file (that gets reloaded again and again due 
> to SPARK-7189.) If this file stays the last modified file for, let's say, an 
> hour, then a lot of executions of that file would have piled up as the 
> default {{spark.history.fs.update.interval}} is 10s. If there is a new 
> eventLog file now, it won't show up in the history server ui for a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up

2015-09-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737452#comment-14737452
 ] 

Marcelo Vanzin commented on SPARK-9924:
---

Timing, I guess (it went in around code freeze time). We can backport it to 
1.5.1.

> checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
> ---
>
> Key: SPARK-9924
> URL: https://issues.apache.org/jira/browse/SPARK-9924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Rohit Agarwal
>Assignee: Rohit Agarwal
> Fix For: 1.6.0
>
>
> {{checkForLogs}} and {{cleanLogs}} are scheduled using 
> {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution 
> takes more time than the interval at which they are scheduled, they get piled 
> up.
> This is a problem on its own but the existence of SPARK-7189 makes it even 
> worse. Let's say there is an eventLog which takes 15s to parse and which 
> happens to be the last modified file (that gets reloaded again and again due 
> to SPARK-7189.) If this file stays the last modified file for, let's say, an 
> hour, then a lot of executions of that file would have piled up as the 
> default {{spark.history.fs.update.interval}} is 10s. If there is a new 
> eventLog file now, it won't show up in the history server ui for a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9503) Mesos dispatcher NullPointerException (MesosClusterScheduler)

2015-09-09 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737453#comment-14737453
 ] 

Timothy Chen commented on SPARK-9503:
-

Sorry this is indeed a bug and a fix is already in 1.5.
Please try out the just released 1.5 and it shouldn't happen.

> Mesos dispatcher NullPointerException (MesosClusterScheduler)
> -
>
> Key: SPARK-9503
> URL: https://issues.apache.org/jira/browse/SPARK-9503
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.1
> Environment: branch-1.4 #8dfdca46dd2f527bf653ea96777b23652bc4eb83
>Reporter: Sebastian YEPES FERNANDEZ
>  Labels: mesosphere
>
> Hello,
> I have just started using start-mesos-dispatcher and have been noticing that 
> some random crashes NPE's
> By looking at the exception it looks like in certain situations the 
> "queuedDrivers" is empty and causes the NPE "submission.cores"
> https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L512-L516
> {code:title=log|borderStyle=solid}
> 15/07/30 23:56:44 INFO MesosRestServer: Started REST server for submitting 
> applications on port 7077
> Exception in thread "Thread-1647" java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:437)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:436)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:436)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:512)
> I0731 00:53:52.969518  7014 sched.cpp:1625] Asked to abort the driver
> I0731 00:53:52.969895  7014 sched.cpp:861] Aborting framework 
> '20150730-234528-4261456064-5050-61754-'
> 15/07/31 00:53:52 INFO MesosClusterScheduler: driver.run() returned with code 
> DRIVER_ABORTED
> {code}
> A side effect of this NPE is that after the crash the dispatcher will not 
> start because its already registered #SPARK-7831
> {code:title=log|borderStyle=solid}
> 15/07/31 09:55:47 INFO MesosClusterUI: Started MesosClusterUI at 
> http://192.168.0.254:8081
> I0731 09:55:47.715039  8162 sched.cpp:157] Version: 0.23.0
> I0731 09:55:47.717013  8163 sched.cpp:254] New master detected at 
> master@192.168.0.254:5050
> I0731 09:55:47.717381  8163 sched.cpp:264] No credentials provided. 
> Attempting to register without authentication
> I0731 09:55:47.718246  8177 sched.cpp:819] Got error 'Completed framework 
> attempted to re-register'
> I0731 09:55:47.718268  8177 sched.cpp:1625] Asked to abort the driver
> 15/07/31 09:55:47 ERROR MesosClusterScheduler: Error received: Completed 
> framework attempted to re-register
> I0731 09:55:47.719091  8177 sched.cpp:861] Aborting framework 
> '20150730-234528-4261456064-5050-61754-0038'
> 15/07/31 09:55:47 INFO MesosClusterScheduler: driver.run() returned with code 
> DRIVER_ABORTED
> 15/07/31 09:55:47 INFO Utils: Shutdown hook called
> {code}
> I can get around this by removing the zk data:
> {code:title=zkCli.sh|borderStyle=solid}
> rmr /spark_mesos_dispatcher
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up

2015-09-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737478#comment-14737478
 ] 

Thomas Graves commented on SPARK-9924:
--

Ok, thanks. wanted to make sure no known issues with pulling it back.  

> checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
> ---
>
> Key: SPARK-9924
> URL: https://issues.apache.org/jira/browse/SPARK-9924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Rohit Agarwal
>Assignee: Rohit Agarwal
> Fix For: 1.6.0
>
>
> {{checkForLogs}} and {{cleanLogs}} are scheduled using 
> {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution 
> takes more time than the interval at which they are scheduled, they get piled 
> up.
> This is a problem on its own but the existence of SPARK-7189 makes it even 
> worse. Let's say there is an eventLog which takes 15s to parse and which 
> happens to be the last modified file (that gets reloaded again and again due 
> to SPARK-7189.) If this file stays the last modified file for, let's say, an 
> hour, then a lot of executions of that file would have piled up as the 
> default {{spark.history.fs.update.interval}} is 10s. If there is a new 
> eventLog file now, it won't show up in the history server ui for a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Vincent Warmerdam (JIRA)
Vincent Warmerdam created SPARK-10520:
-

 Summary: dates cannot be summarised in SparkR
 Key: SPARK-10520
 URL: https://issues.apache.org/jira/browse/SPARK-10520
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Vincent Warmerdam


I create a simple dataframe in R and call the summary function on it (standard 
R, not SparkR). 

```
> library(magrittr)
> df <- data.frame(
  date = as.Date("2015-01-01") + 0:99, 
  r = runif(100)
)
> df %>% summary
  date  r  
 Min.   :2015-01-01   Min.   :0.01221  
 1st Qu.:2015-01-25   1st Qu.:0.30003  
 Median :2015-02-19   Median :0.46416  
 Mean   :2015-02-19   Mean   :0.50350  
 3rd Qu.:2015-03-16   3rd Qu.:0.73361  
 Max.   :2015-04-10   Max.   :0.99618  
```

Notice that the date can be summarised here. In SparkR; this will give an error.

```
> ddf <- createDataFrame(sqlContext, df) 
> ddf %>% summary
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
data type mismatch: function average requires numeric types, not DateType;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at org.apache.spark.sql.
```

This is a rather annoying bug since the SparkR documentation currently suggests 
that dates are now supported in SparkR. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10520:
--
Component/s: SQL

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> ```
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> ```
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> ```
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> ```
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737486#comment-14737486
 ] 

Shivaram Venkataraman commented on SPARK-10520:
---

Thanks for the report -- I think this is a problem in the Spark SQL layer (so 
it should also happen in Scala, Python as well) as we don't support summarizing 
DateType fields

cc [~rxin] [~davies]

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> ```
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> ```
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> ```
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> ```
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10520:

Description: 
I create a simple dataframe in R and call the summary function on it (standard 
R, not SparkR). 

{code}
> library(magrittr)
> df <- data.frame(
  date = as.Date("2015-01-01") + 0:99, 
  r = runif(100)
)
> df %>% summary
  date  r  
 Min.   :2015-01-01   Min.   :0.01221  
 1st Qu.:2015-01-25   1st Qu.:0.30003  
 Median :2015-02-19   Median :0.46416  
 Mean   :2015-02-19   Mean   :0.50350  
 3rd Qu.:2015-03-16   3rd Qu.:0.73361  
 Max.   :2015-04-10   Max.   :0.99618  

{code}

Notice that the date can be summarised here. In SparkR; this will give an error.


{code}
> ddf <- createDataFrame(sqlContext, df) 
> ddf %>% summary
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
data type mismatch: function average requires numeric types, not DateType;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at org.apache.spark.sql.
{code}

This is a rather annoying bug since the SparkR documentation currently suggests 
that dates are now supported in SparkR. 



  was:
I create a simple dataframe in R and call the summary function on it (standard 
R, not SparkR). 

{code}
> library(magrittr)
> df <- data.frame(
  date = as.Date("2015-01-01") + 0:99, 
  r = runif(100)
)
> df %>% summary
  date  r  
 Min.   :2015-01-01   Min.   :0.01221  
 1st Qu.:2015-01-25   1st Qu.:0.30003  
 Median :2015-02-19   Median :0.46416  
 Mean   :2015-02-19   Mean   :0.50350  
 3rd Qu.:2015-03-16   3rd Qu.:0.73361  
 Max.   :2015-04-10   Max.   :0.99618  

{code}

Notice that the date can be summarised here. In SparkR; this will give an error.


{code}
> ddf <- createDataFrame(sqlContext, df) 
> ddf %>% summary
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
data type mismatch: function average requires numeric types, not DateType;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at org.apache.spark.sql.

{code}

This is a rather annoying bug since the SparkR documentation currently suggests 
that dates are now supported in SparkR. 




> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summa

[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10520:

Description: 
I create a simple dataframe in R and call the summary function on it (standard 
R, not SparkR). 

{code}
> library(magrittr)
> df <- data.frame(
  date = as.Date("2015-01-01") + 0:99, 
  r = runif(100)
)
> df %>% summary
  date  r  
 Min.   :2015-01-01   Min.   :0.01221  
 1st Qu.:2015-01-25   1st Qu.:0.30003  
 Median :2015-02-19   Median :0.46416  
 Mean   :2015-02-19   Mean   :0.50350  
 3rd Qu.:2015-03-16   3rd Qu.:0.73361  
 Max.   :2015-04-10   Max.   :0.99618  

{code}

Notice that the date can be summarised here. In SparkR; this will give an error.


{code}
> ddf <- createDataFrame(sqlContext, df) 
> ddf %>% summary
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
data type mismatch: function average requires numeric types, not DateType;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at org.apache.spark.sql.

{code}

This is a rather annoying bug since the SparkR documentation currently suggests 
that dates are now supported in SparkR. 



  was:
I create a simple dataframe in R and call the summary function on it (standard 
R, not SparkR). 

```
> library(magrittr)
> df <- data.frame(
  date = as.Date("2015-01-01") + 0:99, 
  r = runif(100)
)
> df %>% summary
  date  r  
 Min.   :2015-01-01   Min.   :0.01221  
 1st Qu.:2015-01-25   1st Qu.:0.30003  
 Median :2015-02-19   Median :0.46416  
 Mean   :2015-02-19   Mean   :0.50350  
 3rd Qu.:2015-03-16   3rd Qu.:0.73361  
 Max.   :2015-04-10   Max.   :0.99618  
```

Notice that the date can be summarised here. In SparkR; this will give an error.

```
> ddf <- createDataFrame(sqlContext, df) 
> ddf %>% summary
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
data type mismatch: function average requires numeric types, not DateType;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at org.apache.spark.sql.
```

This is a rather annoying bug since the SparkR documentation currently suggests 
that dates are now supported in SparkR. 




> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In

[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Vincent Warmerdam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737520#comment-14737520
 ] 

Vincent Warmerdam commented on SPARK-10520:
---

Thought something similar, it seemed natural to post it here though as it is a 
feature that many R users are used to. 

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Vincent Warmerdam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737520#comment-14737520
 ] 

Vincent Warmerdam edited comment on SPARK-10520 at 9/9/15 8:24 PM:
---

I figured as such, it seemed natural to post it here though as it is a feature 
that many R users are used to. 


was (Author: cantdutchthis):
Thought something similar, it seemed natural to post it here though as it is a 
feature that many R users are used to. 

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename

2015-09-09 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737544#comment-14737544
 ] 

Sanket Reddy commented on SPARK-10436:
--

I am a newbie and interested in it, I will take a look at it.

> spark-submit overwrites spark.files defaults with the job script filename
> -
>
> Key: SPARK-10436
> URL: https://issues.apache.org/jira/browse/SPARK-10436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0
> Environment: Ubuntu, Spark 1.4.0 Standalone
>Reporter: axel dahl
>Priority: Minor
>  Labels: easyfix, feature
>
> In my spark-defaults.conf I have configured a set of libararies to be 
> uploaded to my Spark 1.4.0 Standalone cluster.  The entry appears as:
> spark.files  libarary.zip,file1.py,file2.py
> When I execute spark-submit -v test.py
> I see that spark-submit reads the defaults correctly, but that it overwrites 
> the "spark.files" default entry and replaces it with the name if the job 
> script, i.e. "test.py".
> This behavior doesn't seem intuitive.  test.py, should be added to the spark 
> working folder, but it should not overwrite the "spark.files" defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-09-09 Thread William Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737580#comment-14737580
 ] 

William Cox commented on SPARK-7442:


Between this issue with the Hadoop 2.6 deploy and the bug with Hadoop 2.4 that 
prevents reading zero byte files off HDFS 
(https://issues.apache.org/jira/browse/HADOOP-10589), I'm hosed. Looking 
forward to a fix on this!

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support

2015-09-09 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-10521:
---

 Summary: Utilize Docker to test DB2 JDBC Dialect support
 Key: SPARK-10521
 URL: https://issues.apache.org/jira/browse/SPARK-10521
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Luciano Resende


There was a discussion in SPARK-10170 around using a docker image to execute 
the DB2 JDBC dialect tests. I will use this jira to work on providing the basic 
image together with the test integration. We can then extend the testing 
coverage as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1169) Add countApproxDistinctByKey to PySpark

2015-09-09 Thread William Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737608#comment-14737608
 ] 

William Cox commented on SPARK-1169:


I would like this feature. 

> Add countApproxDistinctByKey to PySpark
> ---
>
> Key: SPARK-1169
> URL: https://issues.apache.org/jira/browse/SPARK-1169
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Matei Zaharia
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737612#comment-14737612
 ] 

Sean Owen commented on SPARK-10519:
---

I always feel nervous when storing human readable times without a timezone 
since they aren't really timestamps without it. There is a standard ISO 8601 
encoding for this. Relying on knowing implicitly what the machine that encoded 
it had set as timezone will cause errors. At least, use GMT consistently? 

> Investigate if we should encode timezone information to a timestamp value 
> stored in JSON
> 
>
> Key: SPARK-10519
> URL: https://issues.apache.org/jira/browse/SPARK-10519
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
> information and the string representation of a timestamp stored in JSON 
> implicitly using the local timezone (see 
> [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
>  
> [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
>  
> [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
>  
> [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
>  This behavior may cause the data consumers got different values when they 
> are in a different timezone with the data producers.
> Since JSON is string based, if we encode timezone information to timestamp 
> value, downstream applications may need to change their code (for example, 
> java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
> hh:mm:ss\[.f...]}}).
> We should investigate what we should do about this issue. Right now, I can 
> think of three options:
> 1. Encoding timezone info in the timestamp value, which can break user code 
> and may change the semantic of timestamp (our timestamp value is 
> timezone-less).
> 2. When saving a timestamp value to json, we treat this value as a value in 
> the local timezone and convert it to UTC time. Then, when save the data, we 
> do not encode timezone info in the value.
> 3. We do not change our current behavior. But, in our doc, we explicitly say 
> that users need to use a single timezone for their datasets (e.g. always use 
> UTC time). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support

2015-09-09 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737635#comment-14737635
 ] 

Luciano Resende commented on SPARK-10521:
-

I'll be submitting a PR for this shortly.

> Utilize Docker to test DB2 JDBC Dialect support
> ---
>
> Key: SPARK-10521
> URL: https://issues.apache.org/jira/browse/SPARK-10521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Luciano Resende
>
> There was a discussion in SPARK-10170 around using a docker image to execute 
> the DB2 JDBC dialect tests. I will use this jira to work on providing the 
> basic image together with the test integration. We can then extend the 
> testing coverage as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737644#comment-14737644
 ] 

Davies Liu commented on SPARK-10439:


There are many places there could be overflow, even for A + B, so I think it's 
not big deal.

If we really want to handle them gracefully, those bound checking should be 
performed during inbound, turn them into null if overflow, not crash (raise 
exception).

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-09-09 Thread Xin Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737650#comment-14737650
 ] 

Xin Jin commented on SPARK-4036:


Are we still actively working on this task? I have some work experience on CRF 
and want to contribute. Thanks.

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737681#comment-14737681
 ] 

Sean Owen commented on SPARK-10493:
---

If the RDD is a result of reduceByKey, I agree that the keys should be unique. 
Tuples implement equals and hashCode correctly, as does String, so that ought 
to be fine.

I still sort of suspect something is getting computed twice and not quite 
deterministic, but the persist() call on rdd4 immediately before ought to hide 
that. However it's still distantly possible this is the cause, since it is not 
computed and persisted before computing rdd5 starts, and might see its 
partitions reevaluated during that process.

It's a bit of a longshot but what about adding an temp4.count() for good 
measure before starting on temp5?

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737727#comment-14737727
 ] 

Glenn Strycker commented on SPARK-10493:


I already have that added in my code that I'm testing... I've been persisting, 
checkpointing, and materializing all RDDs, including all intermediate steps.

I did try substituting union() for zipPartitions(), and that actually resulted 
in correct values!  Very weird.  What's strange is that there is no differences 
in my results on spark-shell or in a very small piece of test code I wrote to 
use spark-submit (that is, I can't replicate the original error), but this 
change did fix things in my production code.

I'm trying to discover why zipPartitions isn't behaving identically to union in 
my code... I posted a stackoverflow question along these lines, if you want to 
read over some additional code and toDebugString results:  
http://stackoverflow.com/questions/32489112/what-is-the-difference-between-union-and-zippartitions-for-apache-spark-rdds

I attempted adding some "implicit ordering" to the original code with 
zipPartitions, but that didn't fix anything -- only using union did it work.

Is it possible that ShuffledRDDs (returned by union) work with reduceByKey, but 
ZippedPartitionsRDD2s (returned by zipPartitions) do not?

Or is it possible that the "++" operator I am using inside the zipPartitions 
function isn't compatible with my particular RDD structure ((String, String), 
(String, Long, Long, Long, Long))?

Thanks so much for your help... at this point I'm tempted to replace 
zipPartitions with unions everywhere in my code, just for superstition's sake.  
I just want to understand WHY zipPartitions didn't work!!

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737730#comment-14737730
 ] 

Sean Owen commented on SPARK-10493:
---

checkpoint doesn't materialize the RDD, which is why it occurred to me to try a 
count. I'd try that to see if it also works. If so I do have some feeling it's 
due to zipping and ordering of partitions -- especially if union() also seems 
to work.

++ is just concatenating iterators, I don't think that can matter. I also don't 
think the parent RDD types matter. It's not impossible there's a problem, but 
there are also a lot of tests exercising reduceByKey.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9996) Create local nested loop join operator

2015-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9996:
-
Assignee: Shixiong Zhu

> Create local nested loop join operator
> --
>
> Key: SPARK-9996
> URL: https://issues.apache.org/jira/browse/SPARK-9996
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9997) Create local Expand operator

2015-09-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9997:
-
Assignee: Shixiong Zhu

> Create local Expand operator
> 
>
> Key: SPARK-9997
> URL: https://issues.apache.org/jira/browse/SPARK-9997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >