[jira] [Commented] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation
[ https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736346#comment-14736346 ] Yu Ishikawa commented on SPARK-10276: - [~mengxr] should we add `@since` = to the class methods with `@classmethod` in PySpark? When I tried to do that, I got an error as follows. It seems that we can't rewrite {{___doc___}} of a `classmethod`. {noformat} Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 175, in MatrixFactorizationModel @classmethod File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 62, in deco f.__doc__ = f.__doc__.rstrip() + "\n\n%s.. versionadded:: %s" % (indent, version) AttributeError: 'classmethod' object attribute '__doc__' is read-only {noformat} > Add @since annotation to pyspark.mllib.recommendation > - > > Key: SPARK-10276 > URL: https://issues.apache.org/jira/browse/SPARK-10276 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10512) Fix @since when a function doesn't have doc
Yu Ishikawa created SPARK-10512: --- Summary: Fix @since when a function doesn't have doc Key: SPARK-10512 URL: https://issues.apache.org/jira/browse/SPARK-10512 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.6.0 Reporter: Yu Ishikawa When I tried to add @since to a function which doesn't have doc, @since didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator. ``` Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 166, in MatrixFactorizationModel @since("1.3.1") File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 63, in deco indents = indent_p.findall(f.__doc__) TypeError: expected string or buffer ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10507) timestamp - timestamp
[ https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736385#comment-14736385 ] Sean Owen commented on SPARK-10507: --- (Can you improve the title and description please?) > timestamp - timestamp > -- > > Key: SPARK-10507 > URL: https://issues.apache.org/jira/browse/SPARK-10507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell > > TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with > Error: Could not create ResultSet: Required field 'type' is unset! > Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". > select cts - cts from tts > Operation: execute > Errors: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type > TimestampType does not support numeric operations > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) > at > org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > create table if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY > '\n' > STORED AS orc ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10507) timestamp - timestamp
[ https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10507: -- Priority: Minor (was: Major) > timestamp - timestamp > -- > > Key: SPARK-10507 > URL: https://issues.apache.org/jira/browse/SPARK-10507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Priority: Minor > > TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with > Error: Could not create ResultSet: Required field 'type' is unset! > Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". > select cts - cts from tts > Operation: execute > Errors: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type > TimestampType does not support numeric operations > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) > at > org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > create table if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY > '\n' > STORED AS orc ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10502) tidy up the exception message text to be less verbose/"User friendly"
[ https://issues.apache.org/jira/browse/SPARK-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10502: -- Issue Type: Improvement (was: Bug) > tidy up the exception message text to be less verbose/"User friendly" > - > > Key: SPARK-10502 > URL: https://issues.apache.org/jira/browse/SPARK-10502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Priority: Minor > > When a statement is parsed, it would be preferred is the exception text were > more aligned with other vendors re indicating the syntax error without the > inclusion of the verbose parse tree. > select tbint.rnum,tbint.cbint, nth_value( tbint.cbint, '4' ) over ( order by > tbint.rnum) from certstring.tbint > Errors: > org.apache.spark.sql.AnalysisException: > Unsupported language features in query: select tbint.rnum,tbint.cbint, > nth_value( tbint.cbint, '4' ) over ( order by tbint.rnum) from > certstring.tbint > TOK_QUERY 1, 0,40, 94 > TOK_FROM 1, 36,40, 94 > TOK_TABREF 1, 38,40, 94 > TOK_TABNAME 1, 38,40, 94 > certstring 1, 38,38, 94 > tbint 1, 40,40, 105 > TOK_INSERT 0, -1,34, 0 > TOK_DESTINATION 0, -1,-1, 0 > TOK_DIR 0, -1,-1, 0 > TOK_TMP_FILE 0, -1,-1, 0 > TOK_SELECT 1, 0,34, 12 > TOK_SELEXPR 1, 2,4, 12 > . 1, 2,4, 12 > TOK_TABLE_OR_COL 1, 2,2, 7 > tbint 1, 2,2, 7 > rnum 1, 4,4, 13 > TOK_SELEXPR 1, 6,8, 23 > . 1, 6,8, 23 > TOK_TABLE_OR_COL 1, 6,6, 18 > tbint 1, 6,6, 18 > cbint 1, 8,8, 24 > TOK_SELEXPR 1, 11,34, 31 > TOK_FUNCTION 1, 11,34, 31 > nth_value 1, 11,11, 31 > . 1, 14,16, 47 > TOK_TABLE_OR_COL 1, 14,14, 42 > tbint 1, 14,14, 42 > cbint 1, 16,16, 48 > '4' 1, 19,19, 55 > TOK_WINDOWSPEC 1, 25,34, 82 > TOK_PARTITIONINGSPEC 1, 27,33, 82 > TOK_ORDERBY 1, 27,33, 82 > TOK_TABSORTCOLNAMEASC 1, 31,33, 82 > . 1, 31,33, 82 > TOK_TABLE_OR_COL 1, 31,31, 77 > tbint 1, 31,31, 77 > rnum 1, 33,33, 83 > scala.NotImplementedError: No parse rules for ASTNode type: 882, text: > TOK_WINDOWSPEC : > TOK_WINDOWSPEC 1, 25,34, 82 > TOK_PARTITIONINGSPEC 1, 27,33, 82 > TOK_ORDERBY 1, 27,33, 82 > TOK_TABSORTCOLNAMEASC 1, 31,33, 82 > . 1, 31,33, 82 > TOK_TABLE_OR_COL 1, 31,31, 77 > tbint 1, 31,31, 77 > rnum 1, 33,33, 83 > " + > > org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10111) StringIndexerModel lacks of method "labels"
[ https://issues.apache.org/jira/browse/SPARK-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10111. --- Resolution: Duplicate > StringIndexerModel lacks of method "labels" > --- > > Key: SPARK-10111 > URL: https://issues.apache.org/jira/browse/SPARK-10111 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Kai Sasaki > > Missing {{labels}} property of {{StringIndexerModel}} in pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-10512: Description: When I tried to add @since to a function which doesn't have doc, @since didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator. {noformat} Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 166, in MatrixFactorizationModel @since("1.3.1") File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 63, in deco indents = indent_p.findall(f.__doc__) TypeError: expected string or buffer {noformat} was: When I tried to add @since to a function which doesn't have doc, @since didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator. ``` Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 166, in MatrixFactorizationModel @since("1.3.1") File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 63, in deco indents = indent_p.findall(f.__doc__) TypeError: expected string or buffer ``` > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-10512: Description: When I tried to add @since to a function which doesn't have doc, @since didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} decorator. {noformat} Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 166, in MatrixFactorizationModel @since("1.3.1") File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 63, in deco indents = indent_p.findall(f.__doc__) TypeError: expected string or buffer {noformat} was: When I tried to add @since to a function which doesn't have doc, @since didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator. {noformat} Traceback (most recent call last): File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 46, in class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): File "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", line 166, in MatrixFactorizationModel @since("1.3.1") File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", line 63, in deco indents = indent_p.findall(f.__doc__) TypeError: expected string or buffer {noformat} > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10512: Assignee: (was: Apache Spark) > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10512: Assignee: Apache Spark > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa >Assignee: Apache Spark > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736389#comment-14736389 ] Apache Spark commented on SPARK-10512: -- User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/8667 > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10444) Remove duplication in Mesos schedulers
[ https://issues.apache.org/jira/browse/SPARK-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736434#comment-14736434 ] Iulian Dragos commented on SPARK-10444: --- Another example of duplicated logic: https://github.com/apache/spark/pull/8639 > Remove duplication in Mesos schedulers > -- > > Key: SPARK-10444 > URL: https://issues.apache.org/jira/browse/SPARK-10444 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.0 >Reporter: Iulian Dragos > Labels: refactoring > > Currently coarse-grained and fine-grained Mesos schedulers don't share much > code, and that leads to inconsistencies. For instance: > - only coarse-grained mode respects {{spark.cores.max}}, see SPARK-9873 > - only coarse-grained mode blacklists slaves that fail repeatedly, but that > seams like generally useful > - constraints and memory checking are done on both sides (code is shared > though) > - framework re-registration (master election) is only done for cluster-mode > deployment > We should find a better design that groups together common concerns and > generally improves the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.
[ https://issues.apache.org/jira/browse/SPARK-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tang Yan updated SPARK-7825: Affects Version/s: (was: 1.3.1) (was: 1.2.2) (was: 1.2.1) (was: 1.3.0) (was: 1.2.0) > Poor performance in Cross Product due to no combine operations for small > files. > --- > > Key: SPARK-7825 > URL: https://issues.apache.org/jira/browse/SPARK-7825 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Tang Yan > > Dealing with Cross Product, if one table has many small files, spark sql > has to handle so many tasks which will lead to poor performance, while Hive > has a CombineHiveInputFormat which can combine small files to decrease the > task number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10227) sbt build on Scala 2.11 fails
[ https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10227. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8433 [https://github.com/apache/spark/pull/8433] > sbt build on Scala 2.11 fails > - > > Key: SPARK-10227 > URL: https://issues.apache.org/jira/browse/SPARK-10227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Luc Bourlier > Fix For: 1.6.0 > > > Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of > 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) > fails to build with sbt on Scala 2.11. > Most of the warning are about the {{@transient}} annotation not being set on > relevant elements, and a few pointing to some potential bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10227) sbt build on Scala 2.11 fails
[ https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10227: -- Assignee: Luc Bourlier > sbt build on Scala 2.11 fails > - > > Key: SPARK-10227 > URL: https://issues.apache.org/jira/browse/SPARK-10227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Luc Bourlier >Assignee: Luc Bourlier > Fix For: 1.6.0 > > > Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of > 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) > fails to build with sbt on Scala 2.11. > Most of the warning are about the {{@transient}} annotation not being set on > relevant elements, and a few pointing to some potential bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10316: -- Assignee: Wenchen Fan > respect non-deterministic expressions in PhysicalOperation > -- > > Key: SPARK-10316 > URL: https://issues.apache.org/jira/browse/SPARK-10316 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > We did a lot of special handling for non-deterministic expressions in > Optimizer. However, PhysicalOperation just collects all Projects and Filters > and messed it up. We should respect the operators order caused by > non-deterministic expressions in PhysicalOperation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4752: - Assignee: Alexander Ulanov > Classifier based on artificial neural network > - > > Key: SPARK-4752 > URL: https://issues.apache.org/jira/browse/SPARK-4752 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Implement classifier based on artificial neural network (ANN). Requirements: > 1) Use the existing artificial neural network implementation > https://issues.apache.org/jira/browse/SPARK-2352, > https://github.com/apache/spark/pull/1290 > 2) Extend MLlib ClassificationModel trait, > 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, > 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10327) Cache Table is not working while subquery has alias in its project list
[ https://issues.apache.org/jira/browse/SPARK-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10327: -- Assignee: Cheng Hao > Cache Table is not working while subquery has alias in its project list > --- > > Key: SPARK-10327 > URL: https://issues.apache.org/jira/browse/SPARK-10327 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 1.6.0 > > > Code to reproduce that: > {code} > import org.apache.spark.sql.hive.execution.HiveTableScan > sql("select key, value, key + 1 from src").registerTempTable("abc") > cacheTable("abc") > val sparkPlan = sql( > """select a.key, b.key, c.key from > |abc a join abc b on a.key=b.key > |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan > assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size > === 3) // failed > assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // > failed > {code} > The query plan like: > {code} > == Parsed Logical Plan == > 'Project > [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)] > 'Join Inner, Some(('a.key = 'c.key)) > 'Join Inner, Some(('a.key = 'b.key)) >'UnresolvedRelation [abc], Some(a) >'UnresolvedRelation [abc], Some(b) > 'UnresolvedRelation [abc], Some(c) > == Analyzed Logical Plan == > key: int, key: int, key: int > Project [key#14,key#61,key#66] > Join Inner, Some((key#14 = key#66)) > Join Inner, Some((key#14 = key#61)) >Subquery a > Subquery abc > Project [key#14,value#15,(key#14 + 1) AS _c2#16] > MetastoreRelation default, src, None >Subquery b > Subquery abc > Project [key#61,value#62,(key#61 + 1) AS _c2#58] > MetastoreRelation default, src, None > Subquery c >Subquery abc > Project [key#66,value#67,(key#66 + 1) AS _c2#63] > MetastoreRelation default, src, None > == Optimized Logical Plan == > Project [key#14,key#61,key#66] > Join Inner, Some((key#14 = key#66)) > Project [key#14,key#61] >Join Inner, Some((key#14 = key#61)) > Project [key#14] > InMemoryRelation [key#14,value#15,_c2#16], true, 1, > StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + > 1) AS _c2#16]), Some(abc) > Project [key#61] > MetastoreRelation default, src, None > Project [key#66] >MetastoreRelation default, src, None > == Physical Plan == > TungstenProject [key#14,key#61,key#66] > BroadcastHashJoin [key#14], [key#66], BuildRight > TungstenProject [key#14,key#61] >BroadcastHashJoin [key#14], [key#61], BuildRight > ConvertToUnsafe > InMemoryColumnarTableScan [key#14], (InMemoryRelation > [key#14,value#15,_c2#16], true, 1, StorageLevel(true, true, false, true, > 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)) > ConvertToUnsafe > HiveTableScan [key#61], (MetastoreRelation default, src, None) > ConvertToUnsafe >HiveTableScan [key#66], (MetastoreRelation default, src, None) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10441) Cannot write timestamp to JSON
[ https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10441: -- Assignee: Yin Huai > Cannot write timestamp to JSON > -- > > Key: SPARK-10441 > URL: https://issues.apache.org/jira/browse/SPARK-10441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10501) support UUID as an atomic type
[ https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10501: -- Priority: Minor (was: Major) Component/s: SQL Issue Type: Improvement (was: Bug) > support UUID as an atomic type > -- > > Key: SPARK-10501 > URL: https://issues.apache.org/jira/browse/SPARK-10501 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jon Haddad >Priority: Minor > > It's pretty common to use UUIDs instead of integers in order to avoid > distributed counters. > I've added this, which at least lets me load dataframes that use UUIDs that I > can cast to strings: > {code} > class UUIDType(AtomicType): > pass > _type_mappings[UUID] = UUIDType > _atomic_types.append(UUIDType) > {code} > But if I try to do anything else with the UUIDs, like this: > {code} > ratings.select("userid").distinct().collect() > {code} > I get this pile of fun: > {code} > scala.MatchError: UUIDType (of class > org.apache.spark.sql.cassandra.types.UUIDType$) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9564) Spark 1.5.0 Testing Plan
[ https://issues.apache.org/jira/browse/SPARK-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736594#comment-14736594 ] Sean Owen commented on SPARK-9564: -- Now that 1.5.0 is released, can this be closed? Or else I'm unclear on the role of these umbrellas and would like to rehash that conversation again. > Spark 1.5.0 Testing Plan > > > Key: SPARK-9564 > URL: https://issues.apache.org/jira/browse/SPARK-9564 > Project: Spark > Issue Type: Epic > Components: Build, Tests >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > This is an epic for Spark 1.5.0 release QA plans for tracking various > components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10513) Springleaf Marketing Response
Yanbo Liang created SPARK-10513: --- Summary: Springleaf Marketing Response Key: SPARK-10513 URL: https://issues.apache.org/jira/browse/SPARK-10513 Project: Spark Issue Type: Sub-task Components: ML Reporter: Yanbo Liang Apply ML pipeline API to Springleaf Marketing Response (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10513) Springleaf Marketing Response
[ https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736648#comment-14736648 ] Yanbo Liang commented on SPARK-10513: - I will work on this dataset. > Springleaf Marketing Response > - > > Key: SPARK-10513 > URL: https://issues.apache.org/jira/browse/SPARK-10513 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang > > Apply ML pipeline API to Springleaf Marketing Response > (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9578) Stemmer feature transformer
[ https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736695#comment-14736695 ] yuhao yang commented on SPARK-9578: --- A better choice for LDA seems to be lemmatization. Yet that requires pos tags and extra vocabulary. If there's no other ongoing effort on this, I'd like to start with a simpler porter implementation, then try to enhance it to snowball. [~josephkb] The plan is to cover the most general cases with shorter code. After all, MLlib is not specific for NLP. > Stemmer feature transformer > --- > > Key: SPARK-9578 > URL: https://issues.apache.org/jira/browse/SPARK-9578 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Transformer mentioned first in [SPARK-5571] based on suggestion from > [~aloknsingh]. Very standard NLP preprocessing task. > From [~aloknsingh]: > {quote} > We have one scala stemmer in scalanlp%chalk > https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze > which can easily copied (as it is apache project) and is in scala too. > I think this will be better alternative than lucene englishAnalyzer or > opennlp. > Note: we already use the scalanlp%breeze via the maven dependency so I think > adding scalanlp%chalk dependency is also the options. But as you had said we > can copy the code as it is small. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode
Akash Mishra created SPARK-10514: Summary: Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode Key: SPARK-10514 URL: https://issues.apache.org/jira/browse/SPARK-10514 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Akash Mishra "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not effecting the Mesos Coarse Grained mode. This is because the scheduler is not overriding the "sufficientResourcesRegistered" function which is true by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time
[ https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] N Campbell updated SPARK-10507: --- Summary: reject temporal expressions such as timestamp - timestamp at parse time (was: timestamp - timestamp ) > reject temporal expressions such as timestamp - timestamp at parse time > > > Key: SPARK-10507 > URL: https://issues.apache.org/jira/browse/SPARK-10507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Priority: Minor > > TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with > Error: Could not create ResultSet: Required field 'type' is unset! > Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". > select cts - cts from tts > Operation: execute > Errors: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type > TimestampType does not support numeric operations > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) > at > org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > create table if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY > '\n' > STORED AS orc ; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time
[ https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] N Campbell updated SPARK-10507: --- Description: TIMESTAMP - TIMESTAMP in ISO-SQL should return an interval type which SPARK does not support.. A similar expression in Hive 0.13 fails with Error: Could not create ResultSet: Required field 'type' is unset! Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". While Hive 1.2.1 has added some interval type support it is far from complete with respect to ISO-SQL. The ability to compute the period of time (years, days, weeks, hours, ...) between timestamps or add/substract intervals from a timestamp are extremely common in business applications. Currently, a value expression such as select timestampcol - timestampcol from t will fail during execution and not parse time. While the error thrown states that fact, it would better for those value expressions to be rejected at parse time along with indicating the expression that is causing the parser error. Operation: execute Errors: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type TimestampType does not support numeric operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) at org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) at org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) create table if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY '\n' STORED AS orc ; was: TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with Error: Could not create ResultSet: Required field 'type' is unset! Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". select cts - cts from tts Operation: execute Errors: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type TimestampType does not support numeric operations at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) at org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) at org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scal
[jira] [Created] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM
KaiXinXIaoLei created SPARK-10515: - Summary: When kill executor, there is no need to seed RequestExecutors to AM Key: SPARK-10515 URL: https://issues.apache.org/jira/browse/SPARK-10515 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: KaiXinXIaoLei Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10515: Assignee: (was: Apache Spark) > When kill executor, there is no need to seed RequestExecutors to AM > --- > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736853#comment-14736853 ] Apache Spark commented on SPARK-10515: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/8668 > When kill executor, there is no need to seed RequestExecutors to AM > --- > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10515: Assignee: Apache Spark > When kill executor, there is no need to seed RequestExecutors to AM > --- > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei >Assignee: Apache Spark > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736869#comment-14736869 ] Glenn Strycker commented on SPARK-10493: The RDD I am using has the form ((String, String), (String, Long, Long, Long, Long)), so the key is actually a (String, String) tuple. Are there any sorting operations that would require implicit ordering, buried under the covers of the reduceByKey operation, that would be causing the problems with non-uniqueness? Does partitionBy(HashPartitioner(numPartitions)) not work with a (String, String) tuple? I've not had any noticeable problems with this before, although that would certainly explain errors in reduceByKey and distinct. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736879#comment-14736879 ] Sean Owen commented on SPARK-10493: --- That much should be OK. zipPartitions only makes sense if you have two ordered, identically partitioned data sets. Is that true of the temp RDDs? Otherwise that could be a source of nondeterminism. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
[ https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll resolved SPARK-8793. -- Resolution: Not A Problem this is no longer occurring. > error/warning with pyspark WholeTextFiles.first > --- > > Key: SPARK-8793 > URL: https://issues.apache.org/jira/browse/SPARK-8793 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Diana Carroll >Priority: Minor > Attachments: wholefilesbug.txt > > > In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working > correctly in pyspark. It works fine in Scala. > I created a directory with two tiny, simple text files. > this works: > {code}sc.wholeTextFiles("testdata").collect(){code} > this doesn't: > {code}sc.wholeTextFiles("testdata").first(){code} > The main error message is: > {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in > stage 12.0 (TID 12) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main > process() > File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft > while taken < left: > ImportError: No module named iter > {code} > I will attach the full stack trace to the JIRA. > I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 > and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736923#comment-14736923 ] Apache Spark commented on SPARK-2960: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8669 > Spark executables fail to start via symlinks > > > Key: SPARK-2960 > URL: https://issues.apache.org/jira/browse/SPARK-2960 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Shay Rojansky >Priority: Minor > > The current scripts (e.g. pyspark) fail to run when they are executed via > symlinks. A common Linux scenario would be to have Spark installed somewhere > (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10428) Struct fields read from parquet are mis-aligned
[ https://issues.apache.org/jira/browse/SPARK-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736949#comment-14736949 ] Apache Spark commented on SPARK-10428: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8670 > Struct fields read from parquet are mis-aligned > --- > > Key: SPARK-10428 > URL: https://issues.apache.org/jira/browse/SPARK-10428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > {code} > val df1 = sqlContext > .range(1) > .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s") > .coalesce(1) > val df2 = sqlContext > .range(1, 2) > .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) > AS s") > .coalesce(1) > df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1") > df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2") > {code} > {code} > sqlContext.read.option("mergeSchema", > "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", > "s.d", “p").show > +---+---+++---+ > | a| b| c| d| p| > +---+---+++---+ > | 0| 3|null|null| 1| > | 1| 2| 3| 4| 2| > +---+---+++---+ > {code} > Looks like the problem is at > https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, > we do padding when global schema has more struct fields than local parquet > file's schema. However, when we read field from parquet, we still use > parquet's local schema and then we put the value of {{d}} to the wrong slot. > I tried master. Looks like this issue is resolved by > https://github.com/apache/spark/pull/8509. We need to decide if we want to > back port that to branch 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
[ https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736947#comment-14736947 ] Apache Spark commented on SPARK-10301: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8670 > For struct type, if parquet's global schema has less fields than a file's > schema, data reading will fail > > > Key: SPARK-10301 > URL: https://issues.apache.org/jira/browse/SPARK-10301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > > We hit this issue when reading a complex Parquet dateset without turning on > schema merging. The data set consists of Parquet files with different but > compatible schemas. In this way, the schema of the dataset is defined by > either a summary file or a random physical Parquet file if no summary files > are available. Apparently, this schema may not containing all fields > appeared in all physicla files. > Parquet was designed with schema evolution and column pruning in mind, so it > should be legal for a user to use a tailored schema to read the dataset to > save disk IO. For example, say we have a Parquet dataset consisting of two > physical Parquet files with the following two schemas: > {noformat} > message m0 { > optional group f0 { > optional int64 f00; > optional int64 f01; > } > } > message m1 { > optional group f0 { > optional int64 f01; > optional int64 f01; > optional int64 f02; > } > optional double f1; > } > {noformat} > Users should be allowed to read the dataset with the following schema: > {noformat} > message m1 { > optional group f0 { > optional int64 f01; > optional int64 f02; > } > } > {noformat} > so that {{f0.f00}} and {{f1}} are never touched. The above case can be > expressed by the following {{spark-shell}} snippet: > {noformat} > import sqlContext._ > import sqlContext.implicits._ > import org.apache.spark.sql.types.{LongType, StructType} > val path = "/tmp/spark/parquet" > range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1) > .write.mode("overwrite").parquet(path) > range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", > "CAST(id AS DOUBLE) AS f1").coalesce(1) > .write.mode("append").parquet(path) > val tailoredSchema = > new StructType() > .add( > "f0", > new StructType() > .add("f01", LongType, nullable = true) > .add("f02", LongType, nullable = true), > nullable = true) > read.schema(tailoredSchema).parquet(path).show() > {noformat} > Expected output should be: > {noformat} > ++ > | f0| > ++ > |[0,null]| > |[1,null]| > |[2,null]| > | [0,0]| > | [1,1]| > | [2,2]| > ++ > {noformat} > However, current 1.5-SNAPSHOT version throws the following exception: > {noformat} > org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in > block -1 in file > hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$an
[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736961#comment-14736961 ] Davies Liu commented on SPARK-10512: As we discussed here https://github.com/apache/spark/pull/8657#discussion_r38992400, we should add a doc for those public API, instead putting a workaround in @since. > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-10512. -- Resolution: Won't Fix > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736973#comment-14736973 ] Yu Ishikawa commented on SPARK-10512: - [~davies] oh, I see. Thank you for letting me know. > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
[ https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7874: --- Assignee: Apache Spark > Add a global setting for the fine-grained mesos scheduler that limits the > number of concurrent tasks of a job > - > > Key: SPARK-7874 > URL: https://issues.apache.org/jira/browse/SPARK-7874 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Thomas Dudziak >Assignee: Apache Spark >Priority: Minor > > This would be a very simple yet effective way to prevent a job dominating the > cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
[ https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7874: --- Assignee: (was: Apache Spark) > Add a global setting for the fine-grained mesos scheduler that limits the > number of concurrent tasks of a job > - > > Key: SPARK-7874 > URL: https://issues.apache.org/jira/browse/SPARK-7874 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Thomas Dudziak >Priority: Minor > > This would be a very simple yet effective way to prevent a job dominating the > cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON
[ https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736986#comment-14736986 ] Don Drake commented on SPARK-10441: --- Got it, thanks for the clarification. > Cannot write timestamp to JSON > -- > > Key: SPARK-10441 > URL: https://issues.apache.org/jira/browse/SPARK-10441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
[ https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736984#comment-14736984 ] Apache Spark commented on SPARK-7874: - User 'dragos' has created a pull request for this issue: https://github.com/apache/spark/pull/8671 > Add a global setting for the fine-grained mesos scheduler that limits the > number of concurrent tasks of a job > - > > Key: SPARK-7874 > URL: https://issues.apache.org/jira/browse/SPARK-7874 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Thomas Dudziak >Priority: Minor > > This would be a very simple yet effective way to prevent a job dominating the > cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737001#comment-14737001 ] Glenn Strycker commented on SPARK-10493: In this example, our RDDs are partitioned with a hash partition, but are not ordered. I think you may be confusing zipPartitions with zipWithIndex... zipPartitions is used to merge two sets partition-wise, which enables a union without requiring any shuffles. We use zipPartitions throughout our code to make things fast, and then apply partitionBy() periodically to do the shuffles only when needed. No ordering is required. We're also not concerned with uniqueness at this point (in fact, for my application I want to keep multiplicity UNTIL the reduceByKey step), so hash collisions and such are ok for our zipPartition union step. As I've been investigating this the past few days, I went ahead and made an intermediate temp RDD that does the zipPartitions, runs partitionBy, persists, checkpoints, and then materializes the RDD. So I think this rules out that zipPartitions is causing the problems downstream for the main RDD, which only runs reduceByKey on the intermediate RDD. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10516) Add values as a property to DenseVector in PySpark
Xiangrui Meng created SPARK-10516: - Summary: Add values as a property to DenseVector in PySpark Key: SPARK-10516 URL: https://issues.apache.org/jira/browse/SPARK-10516 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Priority: Trivial We use `values` in Scala but `array` in PySpark. We should add `values` as a property to match Scala implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
Maciej Bryński created SPARK-10517: -- Summary: Console "Output" field is empty when using DataFrameWriter.json Key: SPARK-10517 URL: https://issues.apache.org/jira/browse/SPARK-10517 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Maciej Bryński Priority: Minor On HTTP application UI "Output" field is empty when using DataFrameWriter.json. Should by size of written bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10517: --- Attachment: screenshot-1.png > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: screenshot-1.png > > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10517: --- Description: On HTTP application UI "Output" field is empty when using DataFrameWriter.json. Should by size of written bytes. Screenshot attached, was: On HTTP application UI "Output" field is empty when using DataFrameWriter.json. Should by size of written bytes. > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: screenshot-1.png > > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. > Screenshot attached, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10517: --- Attachment: (was: screenshot-1.png) > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Priority: Minor > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. > Screenshot attached, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10517: --- Attachment: screenshot-1.png > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: screenshot-1.png > > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. > Screenshot attached, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737051#comment-14737051 ] Sean Owen commented on SPARK-10493: --- I think you still have the same issue with zipPartitions, unless you have an ordering on the RDD, since the partitions may not appear in any particular order, in which case zipping them may give different results. It may still not be the issue though, since a lot of partitionings will happen to have the assumed, same order anyway. Why would this necessarily be better than union()? if you have the same # of partitions and same partitioning you shouldn't have a shuffle. That's also by the by. I can't reproduce this in a simple, similar local example. I think there's something else different between what you're doing and the code snippet here. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737050#comment-14737050 ] Sean Owen commented on SPARK-10493: --- I think you still have the same issue with zipPartitions, unless you have an ordering on the RDD, since the partitions may not appear in any particular order, in which case zipping them may give different results. It may still not be the issue though, since a lot of partitionings will happen to have the assumed, same order anyway. Why would this necessarily be better than union()? if you have the same # of partitions and same partitioning you shouldn't have a shuffle. That's also by the by. I can't reproduce this in a simple, similar local example. I think there's something else different between what you're doing and the code snippet here. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737055#comment-14737055 ] Glenn Strycker edited comment on SPARK-10493 at 9/9/15 3:40 PM: I'm still working on checking unit tests and examples and such, but I'll go ahead and post here some simple code I am currently running in Spark Shell. The attached code works correctly as expected in Spark Shell, but I am getting different results when running my code in an sbt-compiled jar sent to Yarn via spark-submit. Pay special attention to the temp5 RDD, and the toDebugString. This is where my spark-submit code results differ. In that code, I am getting an RDD returned that is not collapsing the key pairs (cluster041,cluster043) or (cluster041,cluster044) was (Author: glenn.stryc...@gmail.com): I'm still working on checking unit tests and examples and such, but I'll go ahead and post here some simply code I am currently running in Spark Shell. The attached code works correctly as expected in Spark Shell, but I am getting different results when running my code in an sbt-compiled jar sent to Yarn via spark-submit. Pay special attention to the temp5 RDD, and the toDebugString. This is where my spark-submit code results differ. In that code, I am getting an RDD returned that is not collapsing the key pairs (cluster041,cluster043) or (cluster041,cluster044) > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker updated SPARK-10493: --- Attachment: reduceByKey_example_001.scala I'm still working on checking unit tests and examples and such, but I'll go ahead and post here some simply code I am currently running in Spark Shell. The attached code works correctly as expected in Spark Shell, but I am getting different results when running my code in an sbt-compiled jar sent to Yarn via spark-submit. Pay special attention to the temp5 RDD, and the toDebugString. This is where my spark-submit code results differ. In that code, I am getting an RDD returned that is not collapsing the key pairs (cluster041,cluster043) or (cluster041,cluster044) > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode
[ https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737095#comment-14737095 ] Apache Spark commented on SPARK-10514: -- User 'SleepyThread' has created a pull request for this issue: https://github.com/apache/spark/pull/8672 > Minimum ratio of registered resources [ > spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse > Grained mode > - > > Key: SPARK-10514 > URL: https://issues.apache.org/jira/browse/SPARK-10514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Akash Mishra > > "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not > effecting the Mesos Coarse Grained mode. This is because the scheduler is not > overriding the "sufficientResourcesRegistered" function which is true by > default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode
[ https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10514: Assignee: (was: Apache Spark) > Minimum ratio of registered resources [ > spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse > Grained mode > - > > Key: SPARK-10514 > URL: https://issues.apache.org/jira/browse/SPARK-10514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Akash Mishra > > "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not > effecting the Mesos Coarse Grained mode. This is because the scheduler is not > overriding the "sufficientResourcesRegistered" function which is true by > default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode
[ https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10514: Assignee: Apache Spark > Minimum ratio of registered resources [ > spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse > Grained mode > - > > Key: SPARK-10514 > URL: https://issues.apache.org/jira/browse/SPARK-10514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Akash Mishra >Assignee: Apache Spark > > "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not > effecting the Mesos Coarse Grained mode. This is because the scheduler is not > overriding the "sufficientResourcesRegistered" function which is true by > default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode
[ https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737103#comment-14737103 ] Akash Mishra commented on SPARK-10514: -- Created a pull request https://github.com/apache/spark/pull/8672 for this bug. > Minimum ratio of registered resources [ > spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse > Grained mode > - > > Key: SPARK-10514 > URL: https://issues.apache.org/jira/browse/SPARK-10514 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Akash Mishra > > "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not > effecting the Mesos Coarse Grained mode. This is because the scheduler is not > overriding the "sufficientResourcesRegistered" function which is true by > default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10117) Implement SQL data source API for reading LIBSVM data
[ https://issues.apache.org/jira/browse/SPARK-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10117. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8537 [https://github.com/apache/spark/pull/8537] > Implement SQL data source API for reading LIBSVM data > - > > Key: SPARK-10117 > URL: https://issues.apache.org/jira/browse/SPARK-10117 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki > Fix For: 1.6.0 > > > It is convenient to implement data source API for LIBSVM format to have a > better integration with DataFrames and ML pipeline API. > {code} > import org.apache.spark.ml.source.libsvm._ > val training = sqlContext.read > .format("libsvm") > .option("numFeatures", "1") > .load("path") > {code} > This JIRA covers the following: > 1. Read LIBSVM data as a DataFrame with two columns: label: Double and > features: Vector. > 2. Accept `numFeatures` as an option. > 3. The implementation should live under `org.apache.spark.ml.source.libsvm`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735964#comment-14735964 ] Yin Huai edited comment on SPARK-10495 at 9/9/15 4:40 PM: -- The bug itself is fixed by https://issues.apache.org/jira/browse/SPARK-10441. was (Author: yhuai): I think it is fixed by https://issues.apache.org/jira/browse/SPARK-10441. > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737161#comment-14737161 ] Yin Huai commented on SPARK-10495: -- Since we shipped Spark 1.5.0 with this issue, it will be good to have a way to read this format in 1.5.1. > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185 ] Davies Liu edited comment on SPARK-10309 at 9/9/15 4:53 PM: [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.explain()` was (Author: davies): [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.eplain()` > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185 ] Davies Liu commented on SPARK-10309: [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.eplain()` > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10518) Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils
Xiangrui Meng created SPARK-10518: - Summary: Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils Key: SPARK-10518 URL: https://issues.apache.org/jira/browse/SPARK-10518 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Minor SPARK-10117 was merged, we should use LIBSVM data source in the example code in spark.ml user guide, e.g., {code} val df = sqlContext.read.format("libsvm").load("path") {code} instead of {code} val df = MLUtils.loadLibSVMFile(sc, "path").toDF() {code} We should update the following: {code} ml-ensembles.md:40:val data = MLUtils.loadLibSVMFile(sc, ml-ensembles.md:87:RDD data = MLUtils.loadLibSVMFile(jsc.sc(), ml-features.md:866:val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() ml-features.md:892:JavaRDD rdd = MLUtils.loadLibSVMFile(sc.sc(), ml-features.md:917:data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() ml-features.md:940:val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") ml-features.md:964: MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD(); ml-features.md:985:data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") ml-features.md:1022:val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") ml-features.md:1047: MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD(); ml-features.md:1068:data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") ml-linear-methods.md:44:val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() ml-linear-methods.md:84:DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class); ml-linear-methods.md:110:training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10495: - Target Version/s: 1.6.0, 1.5.1 (was: 1.5.1) > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10495) For json data source, date values are saved as int strings
[ https://issues.apache.org/jira/browse/SPARK-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10495: - Target Version/s: 1.5.1 Priority: Blocker (was: Critical) > For json data source, date values are saved as int strings > -- > > Key: SPARK-10495 > URL: https://issues.apache.org/jira/browse/SPARK-10495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > {code} > val df = Seq((1, java.sql.Date.valueOf("1900-01-01"))).toDF("i", "j") > df.write.format("json").save("/tmp/testJson") > sc.textFile("/tmp/testJson").collect.foreach(println) > {"i":1,"j":"-25567"} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10481) SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found
[ https://issues.apache.org/jira/browse/SPARK-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10481. Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 1.6.0 > SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found > > > Key: SPARK-10481 > URL: https://issues.apache.org/jira/browse/SPARK-10481 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.4.1 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 1.6.0 > > > It happens when SPARK_PREPEND_CLASSES is set and run spark on yarn. > If SPARK_PREPEND_CLASSES, spark-yarn related jar won't be found. Because the > org.apache.spark.deploy.Client is detected as individual class rather class > in jar. > {code} > 15/09/08 08:57:10 ERROR SparkContext: Error initializing SparkContext. > java.util.NoSuchElementException: head of empty list > at scala.collection.immutable.Nil$.head(List.scala:337) > at scala.collection.immutable.Nil$.head(List.scala:334) > at > org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$sparkJar(Client.scala:1048) > at > org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:1159) > at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:534) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:645) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at org.apache.spark.SparkContext.(SparkContext.scala:514) > at com.zjffdu.tutorial.spark.WordCount$.main(WordCount.scala:24) > at com.zjffdu.tutorial.spark.WordCount.main(WordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737252#comment-14737252 ] Sean Owen commented on SPARK-10493: --- What do you mean that it's not collapsing key pairs? the output of temp5 shows the same keys and same count in both cases. The keys are distinct and in order after {{temp5.sortByKey(true).collect().foreach(println)}} Here's my simplistic test case which gives a consistent count when I run the code above on this: {code} val bWords = sc.broadcast(sc.textFile("/usr/share/dict/words").collect()) val tempRDD1 = sc.parallelize(1 to 1000, 10).mapPartitionsWithIndex { (i, ns) => val words = bWords.value val random = new scala.util.Random(i) ns.map { n => val a = words(random.nextInt(words.length)) val b = words(random.nextInt(words.length)) val c = words(random.nextInt(words.length)) val d = random.nextInt(words.length) val e = random.nextInt(words.length) val f = random.nextInt(words.length) val g = random.nextInt(words.length) ((a, b), (c, d, e, f, g)) } } val tempRDD2 = sc.parallelize(1 to 1000, 10).mapPartitionsWithIndex { (i, ns) => val words = bWords.value val random = new scala.util.Random(i) ns.map { n => val a = words(random.nextInt(words.length)) val b = words(random.nextInt(words.length)) val c = words(random.nextInt(words.length)) val d = random.nextInt(words.length) val e = random.nextInt(words.length) val f = random.nextInt(words.length) val g = random.nextInt(words.length) ((a, b), (c, d, e, f, g)) } } {code} > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737296#comment-14737296 ] Glenn Strycker commented on SPARK-10493: [~srowen], the code I attached did run correctly. However, I have similar code that I run in Yarn via spark-submit that is NOT returning 1 record per key. I mean that when I run code spark-submit that generates temp5, I get a set as follows: {noformat} ((cluster021,cluster023),(cluster021,1,2,1,3)) ((cluster031,cluster033),(cluster031,1,2,1,3)) ((cluster041,cluster043),(cluster041,5,2,1,3)) ((cluster041,cluster043),(cluster041,1,2,1,3)) ((cluster041,cluster044),(cluster041,3,2,1,3)) ((cluster041,cluster044),(cluster041,4,2,1,3)) ((cluster051,cluster052),(cluster051,6,2,1,3)) ((cluster051,cluster053),(cluster051,1,2,1,3)) ((cluster051,cluster054),(cluster051,1,2,1,3)) ((cluster051,cluster055),(cluster051,1,2,1,3)) ((cluster051,cluster056),(cluster051,1,2,1,3)) ((cluster052,cluster053),(cluster051,1,1,1,2)) ((cluster052,cluster054),(cluster051,8,1,1,2)) ((cluster053,cluster054),(cluster051,7,1,1,2)) ((cluster055,cluster056),(cluster051,9,1,1,2)) {noformat} note that the keys (cluster041,cluster043) or (cluster041,cluster044) have 2 records each in the results, which should NEVER happen! Here is what I expected (which is what I see in my example code I attached to this ticket, which ran successfully in spark-shell): {noformat} ((cluster021,cluster023),(cluster021,1,2,1,3)) ((cluster031,cluster033),(cluster031,1,2,1,3)) ((cluster041,cluster043),(cluster041,6,2,1,3)) ((cluster041,cluster044),(cluster041,7,2,1,3)) ((cluster051,cluster052),(cluster051,6,2,1,3)) ((cluster051,cluster053),(cluster051,1,2,1,3)) ((cluster051,cluster054),(cluster051,1,2,1,3)) ((cluster051,cluster055),(cluster051,1,2,1,3)) ((cluster051,cluster056),(cluster051,1,2,1,3)) ((cluster052,cluster053),(cluster051,1,1,1,2)) ((cluster052,cluster054),(cluster051,8,1,1,2)) ((cluster053,cluster054),(cluster051,7,1,1,2)) ((cluster055,cluster056),(cluster051,9,1,1,2)) {noformat} You are right that in my example, distinct is not really the issue, since the records with the same keys do have different values. The issue is with reduceByKey, which is NOT reducing my RDDs correctly and resulting in 1 record per key. Does reduceByKey not support (String, String) keys? > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10461) make sure `input.primitive` is always variable name not code at GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10461. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8613 [https://github.com/apache/spark/pull/8613] > make sure `input.primitive` is always variable name not code at > GenerateUnsafeProjection > > > Key: SPARK-10461 > URL: https://issues.apache.org/jira/browse/SPARK-10461 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
Yin Huai created SPARK-10519: Summary: Investigate if we should encode timezone information to a timestamp value stored in JSON Key: SPARK-10519 URL: https://issues.apache.org/jira/browse/SPARK-10519 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai Priority: Minor Since Spark 1.3, we store a timestamp in JSON without encoding the timezone information and the string representation of a timestamp stored in JSON implicitly using the local timezone (see [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). This behavior may cause the data consumers got different values when they are in a different timezone with the data producers. Since JSON is string based, if we encode timezone information to timestamp value, downstream applications may need to change their code (for example, java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d hh:mm:ss\[.f...]}}). We should investigate what we should do about this issue. Right now, I can think of three options: 1. Encoding timezone info in the timestamp value, which can break user code and may change the semantic of timestamp (our timestamp value is timezone-less). 2. When saving a timestamp value to json, we treat this value as a value in the local timezone and convert it to UTC time. Then, when save the data, we do not encode timezone info in the value. 3. We do not change our current behavior. But, in our doc, we explicitly say that users need to use a single timezone for their datasets (e.g. always use UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
[ https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737311#comment-14737311 ] Yin Huai commented on SPARK-10519: -- cc [~davies] I feel that option 3 is better. > Investigate if we should encode timezone information to a timestamp value > stored in JSON > > > Key: SPARK-10519 > URL: https://issues.apache.org/jira/browse/SPARK-10519 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Since Spark 1.3, we store a timestamp in JSON without encoding the timezone > information and the string representation of a timestamp stored in JSON > implicitly using the local timezone (see > [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], > > [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], > > [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], > > [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). > This behavior may cause the data consumers got different values when they > are in a different timezone with the data producers. > Since JSON is string based, if we encode timezone information to timestamp > value, downstream applications may need to change their code (for example, > java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d > hh:mm:ss\[.f...]}}). > We should investigate what we should do about this issue. Right now, I can > think of three options: > 1. Encoding timezone info in the timestamp value, which can break user code > and may change the semantic of timestamp (our timestamp value is > timezone-less). > 2. When saving a timestamp value to json, we treat this value as a value in > the local timezone and convert it to UTC time. Then, when save the data, we > do not encode timezone info in the value. > 3. We do not change our current behavior. But, in our doc, we explicitly say > that users need to use a single timezone for their datasets (e.g. always use > UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
[ https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10519: - Target Version/s: 1.6.0 > Investigate if we should encode timezone information to a timestamp value > stored in JSON > > > Key: SPARK-10519 > URL: https://issues.apache.org/jira/browse/SPARK-10519 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Since Spark 1.3, we store a timestamp in JSON without encoding the timezone > information and the string representation of a timestamp stored in JSON > implicitly using the local timezone (see > [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], > > [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], > > [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], > > [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). > This behavior may cause the data consumers got different values when they > are in a different timezone with the data producers. > Since JSON is string based, if we encode timezone information to timestamp > value, downstream applications may need to change their code (for example, > java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d > hh:mm:ss\[.f...]}}). > We should investigate what we should do about this issue. Right now, I can > think of three options: > 1. Encoding timezone info in the timestamp value, which can break user code > and may change the semantic of timestamp (our timestamp value is > timezone-less). > 2. When saving a timestamp value to json, we treat this value as a value in > the local timezone and convert it to UTC time. Then, when save the data, we > do not encode timezone info in the value. > 3. We do not change our current behavior. But, in our doc, we explicitly say > that users need to use a single timezone for their datasets (e.g. always use > UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
[ https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737375#comment-14737375 ] Davies Liu commented on SPARK-10519: +1 for 3, user have the ability to control timezone, it's also compatible. > Investigate if we should encode timezone information to a timestamp value > stored in JSON > > > Key: SPARK-10519 > URL: https://issues.apache.org/jira/browse/SPARK-10519 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Since Spark 1.3, we store a timestamp in JSON without encoding the timezone > information and the string representation of a timestamp stored in JSON > implicitly using the local timezone (see > [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], > > [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], > > [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], > > [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). > This behavior may cause the data consumers got different values when they > are in a different timezone with the data producers. > Since JSON is string based, if we encode timezone information to timestamp > value, downstream applications may need to change their code (for example, > java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d > hh:mm:ss\[.f...]}}). > We should investigate what we should do about this issue. Right now, I can > think of three options: > 1. Encoding timezone info in the timestamp value, which can break user code > and may change the semantic of timestamp (our timestamp value is > timezone-less). > 2. When saving a timestamp value to json, we treat this value as a value in > the local timezone and convert it to UTC time. Then, when save the data, we > do not encode timezone info in the value. > 3. We do not change our current behavior. But, in our doc, we explicitly say > that users need to use a single timezone for their datasets (e.g. always use > UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10474: --- Target Version/s: 1.6.0, 1.5.1 Priority: Blocker (was: Critical) > Aggregation failed with unable to acquire memory > > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Priority: Blocker > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
[ https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737416#comment-14737416 ] Thomas Graves commented on SPARK-9924: -- [~vanzin] Any reason this wasn't picked back into spark 1.5 branch? > checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up > --- > > Key: SPARK-9924 > URL: https://issues.apache.org/jira/browse/SPARK-9924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Rohit Agarwal >Assignee: Rohit Agarwal > Fix For: 1.6.0 > > > {{checkForLogs}} and {{cleanLogs}} are scheduled using > {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution > takes more time than the interval at which they are scheduled, they get piled > up. > This is a problem on its own but the existence of SPARK-7189 makes it even > worse. Let's say there is an eventLog which takes 15s to parse and which > happens to be the last modified file (that gets reloaded again and again due > to SPARK-7189.) If this file stays the last modified file for, let's say, an > hour, then a lot of executions of that file would have piled up as the > default {{spark.history.fs.update.interval}} is 10s. If there is a new > eventLog file now, it won't show up in the history server ui for a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
[ https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737452#comment-14737452 ] Marcelo Vanzin commented on SPARK-9924: --- Timing, I guess (it went in around code freeze time). We can backport it to 1.5.1. > checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up > --- > > Key: SPARK-9924 > URL: https://issues.apache.org/jira/browse/SPARK-9924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Rohit Agarwal >Assignee: Rohit Agarwal > Fix For: 1.6.0 > > > {{checkForLogs}} and {{cleanLogs}} are scheduled using > {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution > takes more time than the interval at which they are scheduled, they get piled > up. > This is a problem on its own but the existence of SPARK-7189 makes it even > worse. Let's say there is an eventLog which takes 15s to parse and which > happens to be the last modified file (that gets reloaded again and again due > to SPARK-7189.) If this file stays the last modified file for, let's say, an > hour, then a lot of executions of that file would have piled up as the > default {{spark.history.fs.update.interval}} is 10s. If there is a new > eventLog file now, it won't show up in the history server ui for a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9503) Mesos dispatcher NullPointerException (MesosClusterScheduler)
[ https://issues.apache.org/jira/browse/SPARK-9503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737453#comment-14737453 ] Timothy Chen commented on SPARK-9503: - Sorry this is indeed a bug and a fix is already in 1.5. Please try out the just released 1.5 and it shouldn't happen. > Mesos dispatcher NullPointerException (MesosClusterScheduler) > - > > Key: SPARK-9503 > URL: https://issues.apache.org/jira/browse/SPARK-9503 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.1 > Environment: branch-1.4 #8dfdca46dd2f527bf653ea96777b23652bc4eb83 >Reporter: Sebastian YEPES FERNANDEZ > Labels: mesosphere > > Hello, > I have just started using start-mesos-dispatcher and have been noticing that > some random crashes NPE's > By looking at the exception it looks like in certain situations the > "queuedDrivers" is empty and causes the NPE "submission.cores" > https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L512-L516 > {code:title=log|borderStyle=solid} > 15/07/30 23:56:44 INFO MesosRestServer: Started REST server for submitting > applications on port 7077 > Exception in thread "Thread-1647" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:437) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:436) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:436) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:512) > I0731 00:53:52.969518 7014 sched.cpp:1625] Asked to abort the driver > I0731 00:53:52.969895 7014 sched.cpp:861] Aborting framework > '20150730-234528-4261456064-5050-61754-' > 15/07/31 00:53:52 INFO MesosClusterScheduler: driver.run() returned with code > DRIVER_ABORTED > {code} > A side effect of this NPE is that after the crash the dispatcher will not > start because its already registered #SPARK-7831 > {code:title=log|borderStyle=solid} > 15/07/31 09:55:47 INFO MesosClusterUI: Started MesosClusterUI at > http://192.168.0.254:8081 > I0731 09:55:47.715039 8162 sched.cpp:157] Version: 0.23.0 > I0731 09:55:47.717013 8163 sched.cpp:254] New master detected at > master@192.168.0.254:5050 > I0731 09:55:47.717381 8163 sched.cpp:264] No credentials provided. > Attempting to register without authentication > I0731 09:55:47.718246 8177 sched.cpp:819] Got error 'Completed framework > attempted to re-register' > I0731 09:55:47.718268 8177 sched.cpp:1625] Asked to abort the driver > 15/07/31 09:55:47 ERROR MesosClusterScheduler: Error received: Completed > framework attempted to re-register > I0731 09:55:47.719091 8177 sched.cpp:861] Aborting framework > '20150730-234528-4261456064-5050-61754-0038' > 15/07/31 09:55:47 INFO MesosClusterScheduler: driver.run() returned with code > DRIVER_ABORTED > 15/07/31 09:55:47 INFO Utils: Shutdown hook called > {code} > I can get around this by removing the zk data: > {code:title=zkCli.sh|borderStyle=solid} > rmr /spark_mesos_dispatcher > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9924) checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up
[ https://issues.apache.org/jira/browse/SPARK-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737478#comment-14737478 ] Thomas Graves commented on SPARK-9924: -- Ok, thanks. wanted to make sure no known issues with pulling it back. > checkForLogs and cleanLogs are scheduled at fixed rate and can get piled up > --- > > Key: SPARK-9924 > URL: https://issues.apache.org/jira/browse/SPARK-9924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Rohit Agarwal >Assignee: Rohit Agarwal > Fix For: 1.6.0 > > > {{checkForLogs}} and {{cleanLogs}} are scheduled using > {{ScheduledThreadPoolExecutor.scheduleAtFixedRate}}. When their execution > takes more time than the interval at which they are scheduled, they get piled > up. > This is a problem on its own but the existence of SPARK-7189 makes it even > worse. Let's say there is an eventLog which takes 15s to parse and which > happens to be the last modified file (that gets reloaded again and again due > to SPARK-7189.) If this file stays the last modified file for, let's say, an > hour, then a lot of executions of that file would have piled up as the > default {{spark.history.fs.update.interval}} is 10s. If there is a new > eventLog file now, it won't show up in the history server ui for a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10520) dates cannot be summarised in SparkR
Vincent Warmerdam created SPARK-10520: - Summary: dates cannot be summarised in SparkR Key: SPARK-10520 URL: https://issues.apache.org/jira/browse/SPARK-10520 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Vincent Warmerdam I create a simple dataframe in R and call the summary function on it (standard R, not SparkR). ``` > library(magrittr) > df <- data.frame( date = as.Date("2015-01-01") + 0:99, r = runif(100) ) > df %>% summary date r Min. :2015-01-01 Min. :0.01221 1st Qu.:2015-01-25 1st Qu.:0.30003 Median :2015-02-19 Median :0.46416 Mean :2015-02-19 Mean :0.50350 3rd Qu.:2015-03-16 3rd Qu.:0.73361 Max. :2015-04-10 Max. :0.99618 ``` Notice that the date can be summarised here. In SparkR; this will give an error. ``` > ddf <- createDataFrame(sqlContext, df) > ddf %>% summary Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql. ``` This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10520: -- Component/s: SQL > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > ``` > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > ``` > Notice that the date can be summarised here. In SparkR; this will give an > error. > ``` > > ddf <- createDataFrame(sqlContext, df) > > ddf %>% summary > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to > data type mismatch: function average requires numeric types, not DateType; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at org.apache.spark.sql. > ``` > This is a rather annoying bug since the SparkR documentation currently > suggests that dates are now supported in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737486#comment-14737486 ] Shivaram Venkataraman commented on SPARK-10520: --- Thanks for the report -- I think this is a problem in the Spark SQL layer (so it should also happen in Scala, Python as well) as we don't support summarizing DateType fields cc [~rxin] [~davies] > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > ``` > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > ``` > Notice that the date can be summarised here. In SparkR; this will give an > error. > ``` > > ddf <- createDataFrame(sqlContext, df) > > ddf %>% summary > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to > data type mismatch: function average requires numeric types, not DateType; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at org.apache.spark.sql. > ``` > This is a rather annoying bug since the SparkR documentation currently > suggests that dates are now supported in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10520: Description: I create a simple dataframe in R and call the summary function on it (standard R, not SparkR). {code} > library(magrittr) > df <- data.frame( date = as.Date("2015-01-01") + 0:99, r = runif(100) ) > df %>% summary date r Min. :2015-01-01 Min. :0.01221 1st Qu.:2015-01-25 1st Qu.:0.30003 Median :2015-02-19 Median :0.46416 Mean :2015-02-19 Mean :0.50350 3rd Qu.:2015-03-16 3rd Qu.:0.73361 Max. :2015-04-10 Max. :0.99618 {code} Notice that the date can be summarised here. In SparkR; this will give an error. {code} > ddf <- createDataFrame(sqlContext, df) > ddf %>% summary Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql. {code} This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR. was: I create a simple dataframe in R and call the summary function on it (standard R, not SparkR). {code} > library(magrittr) > df <- data.frame( date = as.Date("2015-01-01") + 0:99, r = runif(100) ) > df %>% summary date r Min. :2015-01-01 Min. :0.01221 1st Qu.:2015-01-25 1st Qu.:0.30003 Median :2015-02-19 Median :0.46416 Mean :2015-02-19 Mean :0.50350 3rd Qu.:2015-03-16 3rd Qu.:0.73361 Max. :2015-04-10 Max. :0.99618 {code} Notice that the date can be summarised here. In SparkR; this will give an error. {code} > ddf <- createDataFrame(sqlContext, df) > ddf %>% summary Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql. {code} This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR. > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > {code} > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > {code} > Notice that the date can be summa
[jira] [Updated] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10520: Description: I create a simple dataframe in R and call the summary function on it (standard R, not SparkR). {code} > library(magrittr) > df <- data.frame( date = as.Date("2015-01-01") + 0:99, r = runif(100) ) > df %>% summary date r Min. :2015-01-01 Min. :0.01221 1st Qu.:2015-01-25 1st Qu.:0.30003 Median :2015-02-19 Median :0.46416 Mean :2015-02-19 Mean :0.50350 3rd Qu.:2015-03-16 3rd Qu.:0.73361 Max. :2015-04-10 Max. :0.99618 {code} Notice that the date can be summarised here. In SparkR; this will give an error. {code} > ddf <- createDataFrame(sqlContext, df) > ddf %>% summary Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql. {code} This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR. was: I create a simple dataframe in R and call the summary function on it (standard R, not SparkR). ``` > library(magrittr) > df <- data.frame( date = as.Date("2015-01-01") + 0:99, r = runif(100) ) > df %>% summary date r Min. :2015-01-01 Min. :0.01221 1st Qu.:2015-01-25 1st Qu.:0.30003 Median :2015-02-19 Median :0.46416 Mean :2015-02-19 Mean :0.50350 3rd Qu.:2015-03-16 3rd Qu.:0.73361 Max. :2015-04-10 Max. :0.99618 ``` Notice that the date can be summarised here. In SparkR; this will give an error. ``` > ddf <- createDataFrame(sqlContext, df) > ddf %>% summary Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql. ``` This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR. > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > {code} > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > {code} > Notice that the date can be summarised here. In
[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737520#comment-14737520 ] Vincent Warmerdam commented on SPARK-10520: --- Thought something similar, it seemed natural to post it here though as it is a feature that many R users are used to. > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > {code} > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > {code} > Notice that the date can be summarised here. In SparkR; this will give an > error. > {code} > > ddf <- createDataFrame(sqlContext, df) > > ddf %>% summary > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to > data type mismatch: function average requires numeric types, not DateType; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at org.apache.spark.sql. > {code} > This is a rather annoying bug since the SparkR documentation currently > suggests that dates are now supported in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10520) dates cannot be summarised in SparkR
[ https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737520#comment-14737520 ] Vincent Warmerdam edited comment on SPARK-10520 at 9/9/15 8:24 PM: --- I figured as such, it seemed natural to post it here though as it is a feature that many R users are used to. was (Author: cantdutchthis): Thought something similar, it seemed natural to post it here though as it is a feature that many R users are used to. > dates cannot be summarised in SparkR > > > Key: SPARK-10520 > URL: https://issues.apache.org/jira/browse/SPARK-10520 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 1.5.0 >Reporter: Vincent Warmerdam > > I create a simple dataframe in R and call the summary function on it > (standard R, not SparkR). > {code} > > library(magrittr) > > df <- data.frame( > date = as.Date("2015-01-01") + 0:99, > r = runif(100) > ) > > df %>% summary > date r > Min. :2015-01-01 Min. :0.01221 > 1st Qu.:2015-01-25 1st Qu.:0.30003 > Median :2015-02-19 Median :0.46416 > Mean :2015-02-19 Mean :0.50350 > 3rd Qu.:2015-03-16 3rd Qu.:0.73361 > Max. :2015-04-10 Max. :0.99618 > {code} > Notice that the date can be summarised here. In SparkR; this will give an > error. > {code} > > ddf <- createDataFrame(sqlContext, df) > > ddf %>% summary > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to > data type mismatch: function average requires numeric types, not DateType; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at org.apache.spark.sql. > {code} > This is a rather annoying bug since the SparkR documentation currently > suggests that dates are now supported in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename
[ https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737544#comment-14737544 ] Sanket Reddy commented on SPARK-10436: -- I am a newbie and interested in it, I will take a look at it. > spark-submit overwrites spark.files defaults with the job script filename > - > > Key: SPARK-10436 > URL: https://issues.apache.org/jira/browse/SPARK-10436 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 > Environment: Ubuntu, Spark 1.4.0 Standalone >Reporter: axel dahl >Priority: Minor > Labels: easyfix, feature > > In my spark-defaults.conf I have configured a set of libararies to be > uploaded to my Spark 1.4.0 Standalone cluster. The entry appears as: > spark.files libarary.zip,file1.py,file2.py > When I execute spark-submit -v test.py > I see that spark-submit reads the defaults correctly, but that it overwrites > the "spark.files" default entry and replaces it with the name if the job > script, i.e. "test.py". > This behavior doesn't seem intuitive. test.py, should be added to the spark > working folder, but it should not overwrite the "spark.files" defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737580#comment-14737580 ] William Cox commented on SPARK-7442: Between this issue with the Hadoop 2.6 deploy and the bug with Hadoop 2.4 that prevents reading zero byte files off HDFS (https://issues.apache.org/jira/browse/HADOOP-10589), I'm hosed. Looking forward to a fix on this! > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support
Luciano Resende created SPARK-10521: --- Summary: Utilize Docker to test DB2 JDBC Dialect support Key: SPARK-10521 URL: https://issues.apache.org/jira/browse/SPARK-10521 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Luciano Resende There was a discussion in SPARK-10170 around using a docker image to execute the DB2 JDBC dialect tests. I will use this jira to work on providing the basic image together with the test integration. We can then extend the testing coverage as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1169) Add countApproxDistinctByKey to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737608#comment-14737608 ] William Cox commented on SPARK-1169: I would like this feature. > Add countApproxDistinctByKey to PySpark > --- > > Key: SPARK-1169 > URL: https://issues.apache.org/jira/browse/SPARK-1169 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Matei Zaharia >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
[ https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737612#comment-14737612 ] Sean Owen commented on SPARK-10519: --- I always feel nervous when storing human readable times without a timezone since they aren't really timestamps without it. There is a standard ISO 8601 encoding for this. Relying on knowing implicitly what the machine that encoded it had set as timezone will cause errors. At least, use GMT consistently? > Investigate if we should encode timezone information to a timestamp value > stored in JSON > > > Key: SPARK-10519 > URL: https://issues.apache.org/jira/browse/SPARK-10519 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Since Spark 1.3, we store a timestamp in JSON without encoding the timezone > information and the string representation of a timestamp stored in JSON > implicitly using the local timezone (see > [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], > > [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], > > [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], > > [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). > This behavior may cause the data consumers got different values when they > are in a different timezone with the data producers. > Since JSON is string based, if we encode timezone information to timestamp > value, downstream applications may need to change their code (for example, > java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d > hh:mm:ss\[.f...]}}). > We should investigate what we should do about this issue. Right now, I can > think of three options: > 1. Encoding timezone info in the timestamp value, which can break user code > and may change the semantic of timestamp (our timestamp value is > timezone-less). > 2. When saving a timestamp value to json, we treat this value as a value in > the local timezone and convert it to UTC time. Then, when save the data, we > do not encode timezone info in the value. > 3. We do not change our current behavior. But, in our doc, we explicitly say > that users need to use a single timezone for their datasets (e.g. always use > UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10521) Utilize Docker to test DB2 JDBC Dialect support
[ https://issues.apache.org/jira/browse/SPARK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737635#comment-14737635 ] Luciano Resende commented on SPARK-10521: - I'll be submitting a PR for this shortly. > Utilize Docker to test DB2 JDBC Dialect support > --- > > Key: SPARK-10521 > URL: https://issues.apache.org/jira/browse/SPARK-10521 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Luciano Resende > > There was a discussion in SPARK-10170 around using a docker image to execute > the DB2 JDBC dialect tests. I will use this jira to work on providing the > basic image together with the test integration. We can then extend the > testing coverage as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737644#comment-14737644 ] Davies Liu commented on SPARK-10439: There are many places there could be overflow, even for A + B, so I think it's not big deal. If we really want to handle them gracefully, those bound checking should be performed during inbound, turn them into null if overflow, not crash (raise exception). > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737650#comment-14737650 ] Xin Jin commented on SPARK-4036: Are we still actively working on this task? I have some work experience on CRF and want to contribute. Thanks. > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737681#comment-14737681 ] Sean Owen commented on SPARK-10493: --- If the RDD is a result of reduceByKey, I agree that the keys should be unique. Tuples implement equals and hashCode correctly, as does String, so that ought to be fine. I still sort of suspect something is getting computed twice and not quite deterministic, but the persist() call on rdd4 immediately before ought to hide that. However it's still distantly possible this is the cause, since it is not computed and persisted before computing rdd5 starts, and might see its partitions reevaluated during that process. It's a bit of a longshot but what about adding an temp4.count() for good measure before starting on temp5? > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737727#comment-14737727 ] Glenn Strycker commented on SPARK-10493: I already have that added in my code that I'm testing... I've been persisting, checkpointing, and materializing all RDDs, including all intermediate steps. I did try substituting union() for zipPartitions(), and that actually resulted in correct values! Very weird. What's strange is that there is no differences in my results on spark-shell or in a very small piece of test code I wrote to use spark-submit (that is, I can't replicate the original error), but this change did fix things in my production code. I'm trying to discover why zipPartitions isn't behaving identically to union in my code... I posted a stackoverflow question along these lines, if you want to read over some additional code and toDebugString results: http://stackoverflow.com/questions/32489112/what-is-the-difference-between-union-and-zippartitions-for-apache-spark-rdds I attempted adding some "implicit ordering" to the original code with zipPartitions, but that didn't fix anything -- only using union did it work. Is it possible that ShuffledRDDs (returned by union) work with reduceByKey, but ZippedPartitionsRDD2s (returned by zipPartitions) do not? Or is it possible that the "++" operator I am using inside the zipPartitions function isn't compatible with my particular RDD structure ((String, String), (String, Long, Long, Long, Long))? Thanks so much for your help... at this point I'm tempted to replace zipPartitions with unions everywhere in my code, just for superstition's sake. I just want to understand WHY zipPartitions didn't work!! > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results
[ https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737730#comment-14737730 ] Sean Owen commented on SPARK-10493: --- checkpoint doesn't materialize the RDD, which is why it occurred to me to try a count. I'd try that to see if it also works. If so I do have some feeling it's due to zipping and ordering of partitions -- especially if union() also seems to work. ++ is just concatenating iterators, I don't think that can matter. I also don't think the parent RDD types matter. It's not impossible there's a problem, but there are also a lot of tests exercising reduceByKey. > reduceByKey not returning distinct results > -- > > Key: SPARK-10493 > URL: https://issues.apache.org/jira/browse/SPARK-10493 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Glenn Strycker > Attachments: reduceByKey_example_001.scala > > > I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs > (using zipPartitions), partitioning by a hash partitioner, and then applying > a reduceByKey to summarize statistics by key. > Since my set before the reduceByKey consists of records such as (K, V1), (K, > V2), (K, V3), I expect the results after reduceByKey to be just (K, > f(V1,V2,V3)), where the function f is appropriately associative, commutative, > etc. Therefore, the results after reduceByKey ought to be distinct, correct? > I am running counts of my RDD and finding that adding an additional > .distinct after my .reduceByKey is changing the final count!! > Here is some example code: > rdd3 = tempRDD1. >zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2). >partitionBy(new HashPartitioner(numPartitions)). >reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, > math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5))) > println(rdd3.count) > rdd4 = rdd3.distinct > println(rdd4.count) > I am using persistence, checkpointing, and other stuff in my actual code that > I did not paste here, so I can paste my actual code if it would be helpful. > This issue may be related to SPARK-2620, except I am not using case classes, > to my knowledge. > See also > http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9996) Create local nested loop join operator
[ https://issues.apache.org/jira/browse/SPARK-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9996: - Assignee: Shixiong Zhu > Create local nested loop join operator > -- > > Key: SPARK-9996 > URL: https://issues.apache.org/jira/browse/SPARK-9996 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9997) Create local Expand operator
[ https://issues.apache.org/jira/browse/SPARK-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-9997: - Assignee: Shixiong Zhu > Create local Expand operator > > > Key: SPARK-9997 > URL: https://issues.apache.org/jira/browse/SPARK-9997 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org