date:20150909

[jira] [Updated] (SPARK-10502) tidy up the exception message text to be less verbose/"User friendly"

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10502:
--
Issue Type: Improvement  (was: Bug)

> tidy up the exception message text to be less verbose/"User friendly"
> -
>
> Key: SPARK-10502
> URL: https://issues.apache.org/jira/browse/SPARK-10502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> When a statement is parsed, it would be preferred is the exception text were 
> more aligned with other vendors re indicating the syntax error without the 
> inclusion of the verbose parse tree.
>  select tbint.rnum,tbint.cbint, nth_value( tbint.cbint, '4' ) over ( order by 
> tbint.rnum) from certstring.tbint 
> Errors:
> org.apache.spark.sql.AnalysisException: 
> Unsupported language features in query: select tbint.rnum,tbint.cbint, 
> nth_value( tbint.cbint, '4' ) over ( order by tbint.rnum) from 
> certstring.tbint
> TOK_QUERY 1, 0,40, 94
>   TOK_FROM 1, 36,40, 94
> TOK_TABREF 1, 38,40, 94
>   TOK_TABNAME 1, 38,40, 94
> certstring 1, 38,38, 94
> tbint 1, 40,40, 105
>   TOK_INSERT 0, -1,34, 0
> TOK_DESTINATION 0, -1,-1, 0
>   TOK_DIR 0, -1,-1, 0
> TOK_TMP_FILE 0, -1,-1, 0
> TOK_SELECT 1, 0,34, 12
>   TOK_SELEXPR 1, 2,4, 12
> . 1, 2,4, 12
>   TOK_TABLE_OR_COL 1, 2,2, 7
> tbint 1, 2,2, 7
>   rnum 1, 4,4, 13
>   TOK_SELEXPR 1, 6,8, 23
> . 1, 6,8, 23
>   TOK_TABLE_OR_COL 1, 6,6, 18
> tbint 1, 6,6, 18
>   cbint 1, 8,8, 24
>   TOK_SELEXPR 1, 11,34, 31
> TOK_FUNCTION 1, 11,34, 31
>   nth_value 1, 11,11, 31
>   . 1, 14,16, 47
> TOK_TABLE_OR_COL 1, 14,14, 42
>   tbint 1, 14,14, 42
> cbint 1, 16,16, 48
>   '4' 1, 19,19, 55
>   TOK_WINDOWSPEC 1, 25,34, 82
> TOK_PARTITIONINGSPEC 1, 27,33, 82
>   TOK_ORDERBY 1, 27,33, 82
> TOK_TABSORTCOLNAMEASC 1, 31,33, 82
>   . 1, 31,33, 82
> TOK_TABLE_OR_COL 1, 31,31, 77
>   tbint 1, 31,31, 77
> rnum 1, 33,33, 83
> scala.NotImplementedError: No parse rules for ASTNode type: 882, text: 
> TOK_WINDOWSPEC :
> TOK_WINDOWSPEC 1, 25,34, 82
>   TOK_PARTITIONINGSPEC 1, 27,33, 82
> TOK_ORDERBY 1, 27,33, 82
>   TOK_TABSORTCOLNAMEASC 1, 31,33, 82
> . 1, 31,33, 82
>   TOK_TABLE_OR_COL 1, 31,31, 77
> tbint 1, 31,31, 77
>   rnum 1, 33,33, 83
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.

2015-09-09 Thread Tang Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tang Yan updated SPARK-7825:

Affects Version/s: (was: 1.3.1)
   (was: 1.2.2)
   (was: 1.2.1)
   (was: 1.3.0)
   (was: 1.2.0)

> Poor performance in Cross Product due to no combine operations for small 
> files.
> ---
>
> Key: SPARK-7825
> URL: https://issues.apache.org/jira/browse/SPARK-7825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tang Yan
>
> Dealing with  Cross Product, if one  table has many small files, spark sql 
> has to handle so many tasks which will lead to poor performance, while Hive 
> has a CombineHiveInputFormat which can combine small files to decrease the 
> task  number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10511) Source releases should not include maven jars

2015-09-09 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-10511:
---

 Summary: Source releases should not include maven jars
 Key: SPARK-10511
 URL: https://issues.apache.org/jira/browse/SPARK-10511
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Patrick Wendell
Priority: Blocker


I noticed our source jars seemed really big for 1.5.0. At least one 
contributing factor is that, likely due to some change in the release script, 
the maven jars are being bundled in with the source code in our build 
directory. This runs afoul of the ASF policy on binaries in source releases - 
we should fix it in 1.5.1.

The issue (I think) is that we might invoke maven to compute the version 
between when we checkout Spark from github and when we package the source file. 
I think it could be fixed by simply clearing out the build/ directory after 
that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10444) Remove duplication in Mesos schedulers

2015-09-09 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736434#comment-14736434
 ] 

Iulian Dragos commented on SPARK-10444:
---

Another example of duplicated logic: https://github.com/apache/spark/pull/8639

> Remove duplication in Mesos schedulers
> --
>
> Key: SPARK-10444
> URL: https://issues.apache.org/jira/browse/SPARK-10444
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.0
>Reporter: Iulian Dragos
>  Labels: refactoring
>
> Currently coarse-grained and fine-grained Mesos schedulers don't share much 
> code, and that leads to inconsistencies. For instance:
> - only coarse-grained mode respects {{spark.cores.max}}, see SPARK-9873
> - only coarse-grained mode blacklists slaves that fail repeatedly, but that 
> seams like generally useful
> - constraints and memory checking are done on both sides (code is shared 
> though)
> - framework re-registration (master election) is only done for cluster-mode 
> deployment
> We should find a better design that groups together common concerns and 
> generally improves the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736389#comment-14736389
 ] 

Apache Spark commented on SPARK-10512:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8667

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10512:


Assignee: (was: Apache Spark)

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10512:


Assignee: Apache Spark

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10274) Add @since annotation to pyspark.mllib.fpm

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10274:


Assignee: Apache Spark

> Add @since annotation to pyspark.mllib.fpm
> --
>
> Key: SPARK-10274
> URL: https://issues.apache.org/jira/browse/SPARK-10274
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10274) Add @since annotation to pyspark.mllib.fpm

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736286#comment-14736286
 ] 

Apache Spark commented on SPARK-10274:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8665

> Add @since annotation to pyspark.mllib.fpm
> --
>
> Key: SPARK-10274
> URL: https://issues.apache.org/jira/browse/SPARK-10274
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10274) Add @since annotation to pyspark.mllib.fpm

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10274:


Assignee: (was: Apache Spark)

> Add @since annotation to pyspark.mllib.fpm
> --
>
> Key: SPARK-10274
> URL: https://issues.apache.org/jira/browse/SPARK-10274
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10507) timestamp - timestamp

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10507:
--
Priority: Minor  (was: Major)

> timestamp - timestamp 
> --
>
> Key: SPARK-10507
> URL: https://issues.apache.org/jira/browse/SPARK-10507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>Priority: Minor
>
> TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
> Error: Could not create ResultSet: Required field 'type' is unset! 
> Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".
> select cts - cts from tts 
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
> TimestampType does not support numeric operations
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY 
> '\n' 
>  STORED AS orc  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10507) timestamp - timestamp

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736385#comment-14736385
 ] 

Sean Owen commented on SPARK-10507:
---

(Can you improve the title and description please?)

> timestamp - timestamp 
> --
>
> Key: SPARK-10507
> URL: https://issues.apache.org/jira/browse/SPARK-10507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: N Campbell
>
> TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
> Error: Could not create ResultSet: Required field 'type' is unset! 
> Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".
> select cts - cts from tts 
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
> TimestampType does not support numeric operations
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY 
> '\n' 
>  STORED AS orc  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-10512:

Description: 
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None}} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}

  was:
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}


> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10111) StringIndexerModel lacks of method "labels"

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10111.
---
Resolution: Duplicate

> StringIndexerModel lacks of method "labels"
> ---
>
> Key: SPARK-10111
> URL: https://issues.apache.org/jira/browse/SPARK-10111
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Kai Sasaki
>
> Missing {{labels}} property of {{StringIndexerModel}} in pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-10512:

Description: 
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
{noformat}

  was:
When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

```
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
```


> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None]} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10275) Add @since annotation to pyspark.mllib.random

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736297#comment-14736297
 ] 

Apache Spark commented on SPARK-10275:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8666

> Add @since annotation to pyspark.mllib.random
> -
>
> Key: SPARK-10275
> URL: https://issues.apache.org/jira/browse/SPARK-10275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-09-09 Thread Glenn Weidner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736295#comment-14736295
 ] 

Glenn Weidner commented on SPARK-7425:
--

Unit tests for other numeric types have not been added.

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10275) Add @since annotation to pyspark.mllib.random

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10275:


Assignee: (was: Apache Spark)

> Add @since annotation to pyspark.mllib.random
> -
>
> Key: SPARK-10275
> URL: https://issues.apache.org/jira/browse/SPARK-10275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10275) Add @since annotation to pyspark.mllib.random

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10275:


Assignee: Apache Spark

> Add @since annotation to pyspark.mllib.random
> -
>
> Key: SPARK-10275
> URL: https://issues.apache.org/jira/browse/SPARK-10275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)

Yu Ishikawa created SPARK-10512:
---

 Summary: Fix @since when a function doesn't have doc
 Key: SPARK-10512
 URL: https://issues.apache.org/jira/browse/SPARK-10512
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Yu Ishikawa


When I tried to add @since to a function which doesn't have doc, @since didn't 
go well. It seems that {{___doc___}} is {{None]} under {{since}} decorator.

```
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 166, in MatrixFactorizationModel
@since("1.3.1")
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 63, in deco
indents = indent_p.findall(f.__doc__)
TypeError: expected string or buffer
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10276) Add @since annotation to pyspark.mllib.recommendation

2015-09-09 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736346#comment-14736346
 ] 

Yu Ishikawa commented on SPARK-10276:
-

[~mengxr] should we add `@since` = to the class methods with `@classmethod` in 
PySpark? When I tried to do that, I got an error as follows. It seems that we 
can't rewrite {{___doc___}} of a `classmethod`.

{noformat}
Traceback (most recent call last):
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 34, 
in _run_code
exec code in run_globals
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 46, in 
class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
  File 
"/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
 line 175, in MatrixFactorizationModel
@classmethod
  File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
line 62, in deco
f.__doc__ = f.__doc__.rstrip() + "\n\n%s.. versionadded:: %s" % (indent, 
version)
AttributeError: 'classmethod' object attribute '__doc__' is read-only
{noformat}

> Add @since annotation to pyspark.mllib.recommendation
> -
>
> Key: SPARK-10276
> URL: https://issues.apache.org/jira/browse/SPARK-10276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736304#comment-14736304
 ] 

Naden Franciscus commented on SPARK-10309:
--

Still working on the physical plan but we have been testing with the latest 
branch-1.5.0 releases which included this fix.

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Naden Franciscus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736304#comment-14736304
 ] 

Naden Franciscus edited comment on SPARK-10309 at 9/9/15 6:43 AM:
--

Still working on the physical plan but we have been testing with the latest 
branch-1.5.0 releases which included this fix. It doesn't help.


was (Author: nadenf):
Still working on the physical plan but we have been testing with the latest 
branch-1.5.0 releases which included this fix.

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10227) sbt build on Scala 2.11 fails

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10227.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8433
[https://github.com/apache/spark/pull/8433]

> sbt build on Scala 2.11 fails
> -
>
> Key: SPARK-10227
> URL: https://issues.apache.org/jira/browse/SPARK-10227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Luc Bourlier
> Fix For: 1.6.0
>
>
> Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of 
> 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) 
> fails to  build with sbt on Scala 2.11.
> Most of the warning are about the {{@transient}} annotation not being set on 
> relevant elements, and a few pointing to some potential bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10227) sbt build on Scala 2.11 fails

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10227:
--
Assignee: Luc Bourlier

> sbt build on Scala 2.11 fails
> -
>
> Key: SPARK-10227
> URL: https://issues.apache.org/jira/browse/SPARK-10227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Luc Bourlier
>Assignee: Luc Bourlier
> Fix For: 1.6.0
>
>
> Scala 2.11 has additional warnings compare to Scala 2.10, and the addition of 
> 'fatal warnings' in the sbt build, the current {{trunk}} (and {{branch-1.5}}) 
> fails to  build with sbt on Scala 2.11.
> Most of the warning are about the {{@transient}} annotation not being set on 
> relevant elements, and a few pointing to some potential bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10327) Cache Table is not working while subquery has alias in its project list

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10327:
--
Assignee: Cheng Hao

> Cache Table is not working while subquery has alias in its project list
> ---
>
> Key: SPARK-10327
> URL: https://issues.apache.org/jira/browse/SPARK-10327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> Code to reproduce that:
> {code}
> import org.apache.spark.sql.hive.execution.HiveTableScan
> sql("select key, value, key + 1 from src").registerTempTable("abc")
> cacheTable("abc")
> val sparkPlan = sql(
>   """select a.key, b.key, c.key from
> |abc a join abc b on a.key=b.key
> |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan
> assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size 
> === 3) // failed
> assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // 
> failed
> {code}
> The query plan like:
> {code}
> == Parsed Logical Plan ==
> 'Project 
> [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
>  'Join Inner, Some(('a.key = 'c.key))
>   'Join Inner, Some(('a.key = 'b.key))
>'UnresolvedRelation [abc], Some(a)
>'UnresolvedRelation [abc], Some(b)
>   'UnresolvedRelation [abc], Some(c)
> == Analyzed Logical Plan ==
> key: int, key: int, key: int
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Join Inner, Some((key#14 = key#61))
>Subquery a
> Subquery abc
>  Project [key#14,value#15,(key#14 + 1) AS _c2#16]
>   MetastoreRelation default, src, None
>Subquery b
> Subquery abc
>  Project [key#61,value#62,(key#61 + 1) AS _c2#58]
>   MetastoreRelation default, src, None
>   Subquery c
>Subquery abc
> Project [key#66,value#67,(key#66 + 1) AS _c2#63]
>  MetastoreRelation default, src, None
> == Optimized Logical Plan ==
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Project [key#14,key#61]
>Join Inner, Some((key#14 = key#61))
> Project [key#14]
>  InMemoryRelation [key#14,value#15,_c2#16], true, 1, 
> StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 
> 1) AS _c2#16]), Some(abc)
> Project [key#61]
>  MetastoreRelation default, src, None
>   Project [key#66]
>MetastoreRelation default, src, None
> == Physical Plan ==
> TungstenProject [key#14,key#61,key#66]
>  BroadcastHashJoin [key#14], [key#66], BuildRight
>   TungstenProject [key#14,key#61]
>BroadcastHashJoin [key#14], [key#61], BuildRight
> ConvertToUnsafe
>  InMemoryColumnarTableScan [key#14], (InMemoryRelation 
> [key#14,value#15,_c2#16], true, 1, StorageLevel(true, true, false, true, 
> 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
> ConvertToUnsafe
>  HiveTableScan [key#61], (MetastoreRelation default, src, None)
>   ConvertToUnsafe
>HiveTableScan [key#66], (MetastoreRelation default, src, None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10441) Cannot write timestamp to JSON

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10441:
--
Assignee: Yin Huai

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4752:
-
Assignee: Alexander Ulanov

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time

2015-09-09 Thread N Campbell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N Campbell updated SPARK-10507:
---
Description: 
TIMESTAMP - TIMESTAMP in ISO-SQL should return an interval type which SPARK 
does not support.. 

A similar expression in Hive 0.13 fails with Error: Could not create ResultSet: 
Required field 'type' is unset! Struct:TPrimitiveTypeEntry(type:null) and SPARK 
has similar "challenges". While Hive 1.2.1 has added some interval type support 
it is far from complete with respect to ISO-SQL. 

The ability to compute the period of time (years, days, weeks, hours, ...) 
between timestamps or add/substract intervals from a timestamp are extremely 
common in business applications. 

Currently, a value expression such as select timestampcol - timestampcol from t 
will fail during execution and not parse time. While the error thrown states 
that fact, it would better for those value expressions to be rejected at parse 
time along with indicating the expression that is causing the parser error.


Operation: execute
Errors:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 
(TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
TimestampType does not support numeric operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

create table  if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY '\n' 
 STORED AS orc  ;


  was:
TIMESTAMP - TIMESTAMP in ISO-SQL is an interval type. Hive 0.13 fails with 
Error: Could not create ResultSet: Required field 'type' is unset! 
Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges".

select cts - cts from tts 



Operation: execute
Errors:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6214.0 
(TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type 
TimestampType does not support numeric operations
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136)
at 
org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at

[jira] [Updated] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10316:
--
Assignee: Wenchen Fan

> respect non-deterministic expressions in PhysicalOperation
> --
>
> Key: SPARK-10316
> URL: https://issues.apache.org/jira/browse/SPARK-10316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> We did a lot of special handling for non-deterministic expressions in 
> Optimizer. However, PhysicalOperation just collects all Projects and Filters 
> and messed it up. We should respect the operators order caused by 
> non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10501) support UUID as an atomic type

2015-09-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10501:
--
   Priority: Minor  (was: Major)
Component/s: SQL
 Issue Type: Improvement  (was: Bug)

> support UUID as an atomic type
> --
>
> Key: SPARK-10501
> URL: https://issues.apache.org/jira/browse/SPARK-10501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jon Haddad
>Priority: Minor
>
> It's pretty common to use UUIDs instead of integers in order to avoid 
> distributed counters.  
> I've added this, which at least lets me load dataframes that use UUIDs that I 
> can cast to strings:
> {code}
> class UUIDType(AtomicType):
> pass
> _type_mappings[UUID] = UUIDType
> _atomic_types.append(UUIDType)
> {code}
> But if I try to do anything else with the UUIDs, like this:
> {code}
> ratings.select("userid").distinct().collect()
> {code}
> I get this pile of fun: 
> {code}
> scala.MatchError: UUIDType (of class 
> org.apache.spark.sql.cassandra.types.UUIDType$)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10513) Springleaf Marketing Response

2015-09-09 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-10513:
---

 Summary: Springleaf Marketing Response
 Key: SPARK-10513
 URL: https://issues.apache.org/jira/browse/SPARK-10513
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yanbo Liang


Apply ML pipeline API to Springleaf Marketing Response 
(https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-09-09 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736648#comment-14736648
 ] 

Yanbo Liang commented on SPARK-10513:
-

I will work on this dataset.

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-09-09 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736695#comment-14736695
 ] 

yuhao yang commented on SPARK-9578:
---

A better choice for LDA seems to be lemmatization. Yet that requires pos tags 
and extra vocabulary. 
If there's no other ongoing effort on this, I'd like to start with a simpler 
porter implementation, then try to enhance it to snowball. [~josephkb] 
The plan is to cover the most general cases with shorter code. After all, MLlib 
is not specific for NLP.

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9564) Spark 1.5.0 Testing Plan

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736594#comment-14736594
 ] 

Sean Owen commented on SPARK-9564:
--

Now that 1.5.0 is released, can this be closed? 
Or else I'm unclear on the role of these umbrellas and would like to rehash 
that conversation again.

> Spark 1.5.0 Testing Plan
> 
>
> Key: SPARK-9564
> URL: https://issues.apache.org/jira/browse/SPARK-9564
> Project: Spark
>  Issue Type: Epic
>  Components: Build, Tests
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> This is an epic for Spark 1.5.0 release QA plans for tracking various 
> components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Akash Mishra (JIRA)

Akash Mishra created SPARK-10514:


 Summary: Minimum ratio of registered resources [ 
spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
Grained mode
 Key: SPARK-10514
 URL: https://issues.apache.org/jira/browse/SPARK-10514
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Akash Mishra


"spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
overriding the "sufficientResourcesRegistered" function which is true by 
default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736947#comment-14736947
 ] 

Apache Spark commented on SPARK-10301:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8670

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> We hit this issue when reading a complex Parquet dateset without turning on 
> schema merging.  The data set consists of Parquet files with different but 
> compatible schemas.  In this way, the schema of the dataset is defined by 
> either a summary file or a random physical Parquet file if no summary files 
> are available.  Apparently, this schema may not containing all fields 
> appeared in all physicla files.
> Parquet was designed with schema evolution and column pruning in mind, so it 
> should be legal for a user to use a tailored schema to read the dataset to 
> save disk IO.  For example, say we have a Parquet dataset consisting of two 
> physical Parquet files with the following two schemas:
> {noformat}
> message m0 {
>   optional group f0 {
> optional int64 f00;
> optional int64 f01;
>   }
> }
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f01;
> optional int64 f02;
>   }
>   optional double f1;
> }
> {noformat}
> Users should be allowed to read the dataset with the following schema:
> {noformat}
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f02;
>   }
> }
> {noformat}
> so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
> expressed by the following {{spark-shell}} snippet:
> {noformat}
> import sqlContext._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{LongType, StructType}
> val path = "/tmp/spark/parquet"
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1)
> .write.mode("overwrite").parquet(path)
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", 
> "CAST(id AS DOUBLE) AS f1").coalesce(1)
> .write.mode("append").parquet(path)
> val tailoredSchema =
>   new StructType()
> .add(
>   "f0",
>   new StructType()
> .add("f01", LongType, nullable = true)
> .add("f02", LongType, nullable = true),
>   nullable = true)
> read.schema(tailoredSchema).parquet(path).show()
> {noformat}
> Expected output should be:
> {noformat}
> ++
> |  f0|
> ++
> |[0,null]|
> |[1,null]|
> |[2,null]|
> |   [0,0]|
> |   [1,1]|
> |   [2,2]|
> ++
> {noformat}
> However, current 1.5-SNAPSHOT version throws the following exception:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
>

[jira] [Commented] (SPARK-10428) Struct fields read from parquet are mis-aligned

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736949#comment-14736949
 ] 

Apache Spark commented on SPARK-10428:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8670

> Struct fields read from parquet are mis-aligned
> ---
>
> Key: SPARK-10428
> URL: https://issues.apache.org/jira/browse/SPARK-10428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> val df1 = sqlContext
> .range(1)
> .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
> .coalesce(1)
> val df2 = sqlContext
>   .range(1, 2)
>   .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) 
> AS s")
>   .coalesce(1)
> df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
> df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
> {code}
> {code}
> sqlContext.read.option("mergeSchema", 
> "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", 
> "s.d", “p").show
> +---+---+++---+
> |  a|  b|   c|   d|  p|
> +---+---+++---+
> |  0|  3|null|null|  1|
> |  1|  2|   3|   4|  2|
> +---+---+++---+
> {code}
> Looks like the problem is at 
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204,
>  we do padding when global schema has more struct fields than local parquet 
> file's schema. However, when we read field from parquet, we still use 
> parquet's local schema and then we put the value of {{d}} to the wrong slot.
> I tried master. Looks like this issue is resolved by 
> https://github.com/apache/spark/pull/8509. We need to decide if we want to 
> back port that to branch 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736879#comment-14736879
 ] 

Sean Owen commented on SPARK-10493:
---

That much should be OK. 
zipPartitions only makes sense if you have two ordered, identically partitioned 
data sets. Is that true of the temp RDDs?
Otherwise that could be a source of nondeterminism.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread KaiXinXIaoLei (JIRA)

KaiXinXIaoLei created SPARK-10515:
-

 Summary: When kill executor, there is no need to seed 
RequestExecutors to AM
 Key: SPARK-10515
 URL: https://issues.apache.org/jira/browse/SPARK-10515
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: KaiXinXIaoLei
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8793) error/warning with pyspark WholeTextFiles.first

2015-09-09 Thread Diana Carroll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll resolved SPARK-8793.
--
Resolution: Not A Problem

this is no longer occurring.

> error/warning with pyspark WholeTextFiles.first
> ---
>
> Key: SPARK-8793
> URL: https://issues.apache.org/jira/browse/SPARK-8793
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Diana Carroll
>Priority: Minor
> Attachments: wholefilesbug.txt
>
>
> In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working 
> correctly in pyspark.  It works fine in Scala.
> I created a directory with two tiny, simple text files.  
> this works:
> {code}sc.wholeTextFiles("testdata").collect(){code}
> this doesn't:
> {code}sc.wholeTextFiles("testdata").first(){code}
> The main error message is:
> {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in 
> stage 12.0 (TID 12)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft
> while taken < left:
> ImportError: No module named iter
> {code}
> I will attach the full stack trace to the JIRA.
> I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0).  Tested in both Python 2.6 
> and 2.7, same results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736869#comment-14736869
 ] 

Glenn Strycker commented on SPARK-10493:


The RDD I am using has the form ((String, String), (String, Long, Long, Long, 
Long)), so the key is actually a (String, String) tuple.

Are there any sorting operations that would require implicit ordering, buried 
under the covers of the reduceByKey operation, that would be causing the 
problems with non-uniqueness?

Does partitionBy(HashPartitioner(numPartitions)) not work with a (String, 
String) tuple?  I've not had any noticeable problems with this before, although 
that would certainly explain errors in reduceByKey and distinct.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736923#comment-14736923
 ] 

Apache Spark commented on SPARK-2960:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8669

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736853#comment-14736853
 ] 

Apache Spark commented on SPARK-10515:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/8668

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10515:


Assignee: Apache Spark

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10515) When kill executor, there is no need to seed RequestExecutors to AM

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10515:


Assignee: (was: Apache Spark)

> When kill executor, there is no need to seed RequestExecutors to AM
> ---
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736961#comment-14736961
 ] 

Davies Liu commented on SPARK-10512:


As we discussed here 
https://github.com/apache/spark/pull/8657#discussion_r38992400, we should add a 
doc for those public API, instead putting a workaround in @since. 

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10512.
--
Resolution: Won't Fix

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: screenshot-1.png

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736973#comment-14736973
 ] 

Yu Ishikawa commented on SPARK-10512:
-

[~davies] oh, I see. Thank you for letting me know.

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7874:
---

Assignee: Apache Spark

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Assignee: Apache Spark
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7874:
---

Assignee: (was: Apache Spark)

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10516) Add values as a property to DenseVector in PySpark

2015-09-09 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-10516:
-

 Summary: Add values as a property to DenseVector in PySpark
 Key: SPARK-10516
 URL: https://issues.apache.org/jira/browse/SPARK-10516
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Priority: Trivial


We use `values` in Scala but `array` in PySpark. We should add `values` as a 
property to match Scala implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA

Maciej Bryński created SPARK-10517:
--

 Summary: Console "Output" field is empty when using 
DataFrameWriter.json
 Key: SPARK-10517
 URL: https://issues.apache.org/jira/browse/SPARK-10517
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Maciej Bryński
Priority: Minor


On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Akash Mishra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737103#comment-14737103
 ] 

Akash Mishra commented on SPARK-10514:
--

Created a pull request https://github.com/apache/spark/pull/8672 for this bug.

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10117) Implement SQL data source API for reading LIBSVM data

2015-09-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10117.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8537
[https://github.com/apache/spark/pull/8537]

> Implement SQL data source API for reading LIBSVM data
> -
>
> Key: SPARK-10117
> URL: https://issues.apache.org/jira/browse/SPARK-10117
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
> Fix For: 1.6.0
>
>
> It is convenient to implement data source API for LIBSVM format to have a 
> better integration with DataFrames and ML pipeline API.
> {code}
> import org.apache.spark.ml.source.libsvm._
> val training = sqlContext.read
>   .format("libsvm")
>   .option("numFeatures", "1")
>   .load("path")
> {code}
> This JIRA covers the following:
> 1. Read LIBSVM data as a DataFrame with two columns: label: Double and 
> features: Vector.
> 2. Accept `numFeatures` as an option.
> 3. The implementation should live under `org.apache.spark.ml.source.libsvm`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737001#comment-14737001
 ] 

Glenn Strycker commented on SPARK-10493:


In this example, our RDDs are partitioned with a hash partition, but are not 
ordered.

I think you may be confusing zipPartitions with zipWithIndex... zipPartitions 
is used to merge two sets partition-wise, which enables a union without 
requiring any shuffles.  We use zipPartitions throughout our code to make 
things fast, and then apply partitionBy() periodically to do the shuffles only 
when needed.  No ordering is required.  We're also not concerned with 
uniqueness at this point (in fact, for my application I want to keep 
multiplicity UNTIL the reduceByKey step), so hash collisions and such are ok 
for our zipPartition union step.

As I've been investigating this the past few days, I went ahead and made an 
intermediate temp RDD that does the zipPartitions, runs partitionBy, persists, 
checkpoints, and then materializes the RDD.  So I think this rules out that 
zipPartitions is causing the problems downstream for the main RDD, which only 
runs reduceByKey on the intermediate RDD.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: (was: screenshot-1.png)

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10514:


Assignee: Apache Spark

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>Assignee: Apache Spark
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Attachment: screenshot-1.png

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737055#comment-14737055
 ] 

Glenn Strycker edited comment on SPARK-10493 at 9/9/15 3:40 PM:


I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simple code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)



was (Author: glenn.stryc...@gmail.com):
I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simply code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)


> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker updated SPARK-10493:
---
Attachment: reduceByKey_example_001.scala

I'm still working on checking unit tests and examples and such, but I'll go 
ahead and post here some simply code I am currently running in Spark Shell.  
The attached code works correctly as expected in Spark Shell, but I am getting 
different results when running my code in an sbt-compiled jar sent to Yarn via 
spark-submit.

Pay special attention to the temp5 RDD, and the toDebugString.  This is where 
my spark-submit code results differ.  In that code, I am getting an RDD 
returned that is not collapsing the key pairs (cluster041,cluster043) or 
(cluster041,cluster044)


> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-09 Thread Don Drake (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736986#comment-14736986
 ] 

Don Drake commented on SPARK-10441:
---

Got it, thanks for the clarification.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736984#comment-14736984
 ] 

Apache Spark commented on SPARK-7874:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/8671

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Description: 
On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.

Screenshot attached,

  was:
On HTTP application UI "Output" field is empty when using DataFrameWriter.json.

Should by size of written bytes.


> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737051#comment-14737051
 ] 

Sean Owen commented on SPARK-10493:
---

I think you still have the same issue with zipPartitions, unless you have an 
ordering on the RDD, since the partitions may not appear in any particular 
order, in which case zipping them may give different results. It may still not 
be the issue though, since a lot of partitionings will happen to have the 
assumed, same order anyway.

Why would this necessarily be better than union()? if you have the same # of 
partitions and same partitioning you shouldn't have a shuffle. That's also by 
the by.

I can't reproduce this in a simple, similar local example. I think there's 
something else different between what you're doing and the code snippet here.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737050#comment-14737050
 ] 

Sean Owen commented on SPARK-10493:
---

I think you still have the same issue with zipPartitions, unless you have an 
ordering on the RDD, since the partitions may not appear in any particular 
order, in which case zipping them may give different results. It may still not 
be the issue though, since a lot of partitionings will happen to have the 
assumed, same order anyway.

Why would this necessarily be better than union()? if you have the same # of 
partitions and same partitioning you shouldn't have a shuffle. That's also by 
the by.

I can't reproduce this in a simple, similar local example. I think there's 
something else different between what you're doing and the code snippet here.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10514:


Assignee: (was: Apache Spark)

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10514) Minimum ratio of registered resources [ spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse Grained mode

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737095#comment-14737095
 ] 

Apache Spark commented on SPARK-10514:
--

User 'SleepyThread' has created a pull request for this issue:
https://github.com/apache/spark/pull/8672

> Minimum ratio of registered resources [ 
> spark.scheduler.minRegisteredResourcesRatio] is not enabled for Mesos Coarse 
> Grained mode
> -
>
> Key: SPARK-10514
> URL: https://issues.apache.org/jira/browse/SPARK-10514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Akash Mishra
>
> "spark.scheduler.minRegisteredResourcesRatio" configuration parameter is not 
> effecting the Mesos Coarse Grained mode. This is because the scheduler is not 
> overriding the "sufficientResourcesRegistered" function which is true by 
> default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737730#comment-14737730
 ] 

Sean Owen commented on SPARK-10493:
---

checkpoint doesn't materialize the RDD, which is why it occurred to me to try a 
count. I'd try that to see if it also works. If so I do have some feeling it's 
due to zipping and ordering of partitions -- especially if union() also seems 
to work.

++ is just concatenating iterators, I don't think that can matter. I also don't 
think the parent RDD types matter. It's not impossible there's a problem, but 
there are also a lot of tests exercising reduceByKey.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9998) Create local intersect operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9998:
-
Assignee: Shixiong Zhu

> Create local intersect operator
> ---
>
> Key: SPARK-9998
> URL: https://issues.apache.org/jira/browse/SPARK-9998
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9997) Create local Expand operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9997:
-
Assignee: Shixiong Zhu

> Create local Expand operator
> 
>
> Key: SPARK-9997
> URL: https://issues.apache.org/jira/browse/SPARK-9997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-09-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737802#comment-14737802
 ] 

Joseph K. Bradley commented on SPARK-6724:
--

Now that the 1.5 release stuff is over, yes!  Thanks for your patience.

I will assume:
* FPGrowth should continue to support arbitrary types (in the spark.mllib API). 
 I.e., we should not change its public interface.
* Like other models, FPGrowth should use DataFrame serialization for model 
save/load.

Given these constraints, I think the best way to implement save/load is:
* Use DataFrames/Catalyst to test whether the item type is a type recognized by 
Catalyst (probably using {{ScalaReflection.schemaFor}}).
* If the item type is not OK, throw an error.
* If the item type is OK, save as a DataFrame.

We should definitely support all DataFrame types.  There is no need to limit 
items to primitive Catalyst types.

In the future, once UDTs are a public API, we could allow users to make their 
custom types implement the UDT interface so that we can convert them to 
Catalyst types.

[~MeethuMathew] Can you please update your PR accordingly?  I should have time 
to give feedback or collaborate on the coding. 

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737801#comment-14737801
 ] 

Shivaram Venkataraman commented on SPARK-10523:
---

cc [~mengxr] [~ekhliang]

> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Bug
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10523:
--
Component/s: SparkR

> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Vincent Warmerdam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-10523:
--
Issue Type: Improvement  (was: Bug)

> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10523:
--
Component/s: ML

> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737727#comment-14737727
 ] 

Glenn Strycker commented on SPARK-10493:


I already have that added in my code that I'm testing... I've been persisting, 
checkpointing, and materializing all RDDs, including all intermediate steps.

I did try substituting union() for zipPartitions(), and that actually resulted 
in correct values!  Very weird.  What's strange is that there is no differences 
in my results on spark-shell or in a very small piece of test code I wrote to 
use spark-submit (that is, I can't replicate the original error), but this 
change did fix things in my production code.

I'm trying to discover why zipPartitions isn't behaving identically to union in 
my code... I posted a stackoverflow question along these lines, if you want to 
read over some additional code and toDebugString results:  
http://stackoverflow.com/questions/32489112/what-is-the-difference-between-union-and-zippartitions-for-apache-spark-rdds

I attempted adding some "implicit ordering" to the original code with 
zipPartitions, but that didn't fix anything -- only using union did it work.

Is it possible that ShuffledRDDs (returned by union) work with reduceByKey, but 
ZippedPartitionsRDD2s (returned by zipPartitions) do not?

Or is it possible that the "++" operator I am using inside the zipPartitions 
function isn't compatible with my particular RDD structure ((String, String), 
(String, Long, Long, Long, Long))?

Thanks so much for your help... at this point I'm tempted to replace 
zipPartitions with unions everywhere in my code, just for superstition's sake.  
I just want to understand WHY zipPartitions didn't work!!

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737738#comment-14737738
 ] 

Apache Spark commented on SPARK-10522:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8674

> Nanoseconds part of Timestamp should be positive in parquet
> ---
>
> Key: SPARK-10522
> URL: https://issues.apache.org/jira/browse/SPARK-10522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
> can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10522:


Assignee: Apache Spark

> Nanoseconds part of Timestamp should be positive in parquet
> ---
>
> Key: SPARK-10522
> URL: https://issues.apache.org/jira/browse/SPARK-10522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
> can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10522:


Assignee: (was: Apache Spark)

> Nanoseconds part of Timestamp should be positive in parquet
> ---
>
> Key: SPARK-10522
> URL: https://issues.apache.org/jira/browse/SPARK-10522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
> can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737743#comment-14737743
 ] 

Reynold Xin commented on SPARK-10520:
-

Is the idea here to support aggregation functions on date and timestamp?

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737757#comment-14737757
 ] 

Vincent Warmerdam commented on SPARK-10520:
---

It just occured to me that there is a very similar error with machine learning. 
In R you can pass a date/timestamp into a model and it will treat it as if it 
were a numeric. 

```
> df <- data.frame(d = as.Date('2014-01-01') + 1:100, r = runif(100) + 0.5 * 
> 1:100)
> lm(r ~ d, data = df)

Call:
lm(formula = r ~ d, data = df)

Coefficients:
(Intercept)d  
 -7994.9971   0.4975  
```

I'm not sure if Spark wants to have similar support but it may be something to 
keep in mind; the problem seems similar. 

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Vincent Warmerdam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-10523:
--
Description: 
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
classification unless you want to run one for class) but it certainly is 
relevant if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 


  was:

In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
regression unless you want to run one for class) but it certainly is relevant 
if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in

[jira] [Commented] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737797#comment-14737797
 ] 

Apache Spark commented on SPARK-9715:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/8675

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9715:
---

Assignee: (was: Apache Spark)

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9715:
---

Assignee: Apache Spark

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737681#comment-14737681
 ] 

Sean Owen commented on SPARK-10493:
---

If the RDD is a result of reduceByKey, I agree that the keys should be unique. 
Tuples implement equals and hashCode correctly, as does String, so that ought 
to be fine.

I still sort of suspect something is getting computed twice and not quite 
deterministic, but the persist() call on rdd4 immediately before ought to hide 
that. However it's still distantly possible this is the cause, since it is not 
computed and persisted before computing rdd5 starts, and might see its 
partitions reevaluated during that process.

It's a bit of a longshot but what about adding an temp4.count() for good 
measure before starting on temp5?

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737757#comment-14737757
 ] 

Vincent Warmerdam edited comment on SPARK-10520 at 9/9/15 10:58 PM:


It just occured to me that there is a very similar error with machine learning. 
In R you can pass a date/timestamp into a model and it will treat it as if it 
were a numeric. 

{code}
> df <- data.frame(d = as.Date('2014-01-01') + 1:100, r = runif(100) + 0.5 * 
> 1:100)
> lm(r ~ d, data = df)

Call:
lm(formula = r ~ d, data = df)

Coefficients:
(Intercept)d  
 -7994.9971   0.4975  
{code}

I'm not sure if Spark wants to have similar support but it may be something to 
keep in mind; the problem seems similar. 


was (Author: cantdutchthis):
It just occured to me that there is a very similar error with machine learning. 
In R you can pass a date/timestamp into a model and it will treat it as if it 
were a numeric. 

```
> df <- data.frame(d = as.Date('2014-01-01') + 1:100, r = runif(100) + 0.5 * 
> 1:100)
> lm(r ~ d, data = df)

Call:
lm(formula = r ~ d, data = df)

Coefficients:
(Intercept)d  
 -7994.9971   0.4975  
```

I'm not sure if Spark wants to have similar support but it may be something to 
keep in mind; the problem seems similar. 

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Vincent Warmerdam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-10523:
--
Description: 

In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
regression unless you want to run one for class) but it certainly is relevant 
if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 


  was:
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
regression unless you want to run one for class) but it certainly is relevant 
if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in

[jira] [Updated] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Vincent Warmerdam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-10523:
--
Description: 
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. This is perhaps less 
relevant for logistic regression (because it is a bit more like a one-off 
regression unless you want to run one for class) but it certainly is relevant 
if you would want to use a formula for a randomforest. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 


  was:
In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 



> SparkR formula syntax to turn strings/factors into numerics
> ---
>

[jira] [Created] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2015-09-09 Thread Vincent Warmerdam (JIRA)

Vincent Warmerdam created SPARK-10523:
-

 Summary: SparkR formula syntax to turn strings/factors into 
numerics
 Key: SPARK-10523
 URL: https://issues.apache.org/jira/browse/SPARK-10523
 Project: Spark
  Issue Type: Bug
Reporter: Vincent Warmerdam


In normal (non SparkR) R the formula syntax enables strings or factors to be 
turned into dummy variables immediately when calling a classifier. This way, 
the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
at 
org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at 
org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at 
org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at 
org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that 
are using multiple classes that need classification. 

Is there a good reason why this should not be a feature of formulas in Spark? I 
am aware of issue 8774, which looks like it is adressing a similar theme but a 
different issue. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737790#comment-14737790
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

Does this failure require there to be an ML model at all?  Or can you reproduce 
it only using dataframes?

Also, can you reproduce it using nothing from ML (not using LabeledPoint or 
Vector)?

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10520) dates cannot be summarised in SparkR

2015-09-09 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737805#comment-14737805
 ] 

Shivaram Venkataraman commented on SPARK-10520:
---

[~rxin] Yeah the idea here is to support operators like mean, median etc. on 
date, timestamp.

> dates cannot be summarised in SparkR
> 
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9996) Create local nested loop join operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9996:
-
Assignee: Shixiong Zhu

> Create local nested loop join operator
> --
>
> Key: SPARK-9996
> URL: https://issues.apache.org/jira/browse/SPARK-9996
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9992) Create local sample operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9992:
-
Assignee: Shixiong Zhu

> Create local sample operator
> 
>
> Key: SPARK-9992
> URL: https://issues.apache.org/jira/browse/SPARK-9992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9994) Create local TopK operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9994:
-
Assignee: Shixiong Zhu

> Create local TopK operator
> --
>
> Key: SPARK-9994
> URL: https://issues.apache.org/jira/browse/SPARK-9994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>
> Similar to the existing TakeOrderedAndProject, except in a single thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9990) Create local hash join operator

2015-09-09 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9990:
-
Assignee: Shixiong Zhu

> Create local hash join operator
> ---
>
> Key: SPARK-9990
> URL: https://issues.apache.org/jira/browse/SPARK-9990
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-09-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9730.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8579
[https://github.com/apache/spark/pull/8579]

> Sort Merge Join for Full Outer Join
> ---
>
> Key: SPARK-9730
> URL: https://issues.apache.org/jira/browse/SPARK-9730
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-09 Thread Glenn Strycker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737735#comment-14737735
 ] 

Glenn Strycker commented on SPARK-10493:


Of course.  I have count statements everywhere in order to materialize.

I usually additionally run RDD.sortByKey(true).collect().foreach(println) if 
I'm running on a small test set, or RDD.take(100).collect().foreach(println) if 
I have a larger set, just so I can see a few values.

So I'm positive that all of my intermediate/temporary RDDs are in fact 
materialized before getting to the zipPartitions/union step and the reduceByKey 
step.  I also monitor my jobs in Yarn and I can see the persisted RDDs as they 
are being cached.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
> Attachments: reduceByKey_example_001.scala
>
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 186 matches

Mail list logo