[jira] [Created] (SPARK-3613) Don't record the size of each shuffle block for large jobs

2014-09-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3613:
--

 Summary: Don't record the size of each shuffle block for large jobs
 Key: SPARK-3613
 URL: https://issues.apache.org/jira/browse/SPARK-3613
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


MapStatus saves the size of each block (1 byte per block) for a particular map 
task. This actually means the shuffle metadata is O(M*R), where M = num maps 
and R = num reduces.

If M is greater than a certain size, we should probably just send an average 
size instead of a whole array.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver

2014-09-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141807#comment-14141807
 ] 

Reynold Xin commented on SPARK-3612:


[~andrewor14] [~sandyryza] any comment on this? I think you guys worked on this 
code.

> Executor shouldn't quit if heartbeat message fails to reach the driver
> --
>
> Key: SPARK-3612
> URL: https://issues.apache.org/jira/browse/SPARK-3612
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>
> The thread started by Executor.startDriverHeartbeater can actually terminate 
> the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an 
> exception. 
> I don't think we should quit the executor this way. At the very least, we 
> would want to log a more meaningful exception then simply
> {code}
> 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
> 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
> 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
> 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
> in thread Thread[Driver Heartbeater,5,main]
> org.apache.spark.SparkException: Error sending message [message = 
> Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, 
> ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
> at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
> ... 1 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver

2014-09-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3612:
--

 Summary: Executor shouldn't quit if heartbeat message fails to 
reach the driver
 Key: SPARK-3612
 URL: https://issues.apache.org/jira/browse/SPARK-3612
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin


The thread started by Executor.startDriverHeartbeater can actually terminate 
the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an 
exception. 

I don't think we should quit the executor this way. At the very least, we would 
want to log a more meaningful exception then simply
{code}
14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Driver Heartbeater,5,main]
org.apache.spark.SparkException: Error sending message [message = 
Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, 
ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
... 1 more

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3611) Show number of cores for each executor in application web UI

2014-09-19 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3611:


 Summary: Show number of cores for each executor in application web 
UI
 Key: SPARK-3611
 URL: https://issues.apache.org/jira/browse/SPARK-3611
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Matei Zaharia
Priority: Minor


This number is not always fully known, because e.g. in Mesos your executors can 
scale up and down in # of CPUs, but it would be nice to show at least the 
number of cores the machine has in that case, or the # of cores the executor 
has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141573#comment-14141573
 ] 

Apache Spark commented on SPARK-3606:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2469

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3610) Unable to load app logs for MLLib programs in history server

2014-09-19 Thread SK (JIRA)
SK created SPARK-3610:
-

 Summary: Unable to load app logs for MLLib programs in history 
server 
 Key: SPARK-3610
 URL: https://issues.apache.org/jira/browse/SPARK-3610
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
 Fix For: 1.1.0


The default log files for the Mllib examples use a rather long naming 
convention that includes special characters like parentheses and comma.For e.g. 
one of my log files is named 
"binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".

When I click on the program on the history server page (at port 18080), to view 
the detailed application logs, the history server crashes and I need to restart 
it. I am using Spark 1.1 on a mesos cluster.

I renamed the  log file by removing the special characters and  then it loads 
up correctly. I am not sure which program is creating the log files. Can it be 
changed so that the default log file naming convention does not include  
special characters? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal

2014-09-19 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141510#comment-14141510
 ] 

Cheng Lian commented on SPARK-2271:
---

[~pwendell] I can't find a Maven artifact for this. From the Hive JIRA Reynold 
pointed out, the {{Decimal128}} comes from Microsoft PolyBase, which I think is 
not open source.

> Use Hive's high performance Decimal128 to replace BigDecimal
> 
>
> Key: SPARK-2271
> URL: https://issues.apache.org/jira/browse/SPARK-2271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>
> Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3609) Add sizeInBytes statistics to Limit operator

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141493#comment-14141493
 ] 

Apache Spark commented on SPARK-3609:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2468

> Add sizeInBytes statistics to Limit operator
> 
>
> Key: SPARK-3609
> URL: https://issues.apache.org/jira/browse/SPARK-3609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated 
> fairly precisely when all output attributes are of native data types, all 
> native data types except {{StringType}} have fixed size. For {{StringType}}, 
> we can use a relatively large (say 4K) default size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3609) Add sizeInBytes statistics to Limit operator

2014-09-19 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3609:
-

 Summary: Add sizeInBytes statistics to Limit operator
 Key: SPARK-3609
 URL: https://issues.apache.org/jira/browse/SPARK-3609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated fairly 
precisely when all output attributes are of native data types, all native data 
types except {{StringType}} have fixed size. For {{StringType}}, we can use a 
relatively large (say 4K) default size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3485) should check parameter type when find constructors

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3485.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2407
[https://github.com/apache/spark/pull/2407]

> should check parameter type when find constructors
> --
>
> Key: SPARK-3485
> URL: https://issues.apache.org/jira/browse/SPARK-3485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
> Fix For: 1.2.0
>
>
> In hiveUdfs, we get constructors of primitivetypes by find a constructor 
> which takes only one parameter. This is very dangerous when more than one 
> constructors match. When the sequence of primitiveTypes becomes larger, the 
> problem would occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3605) Typo in SchemaRDD JavaDoc

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3605.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2460
[https://github.com/apache/spark/pull/2460]

> Typo in SchemaRDD JavaDoc
> -
>
> Key: SPARK-3605
> URL: https://issues.apache.org/jira/browse/SPARK-3605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sandy Ryza
>Priority: Trivial
> Fix For: 1.2.0
>
>
> "Examples are loading data from Parquet files by using by using the"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3592) applySchema to an RDD of Row

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3592.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2448
[https://github.com/apache/spark/pull/2448]

> applySchema to an RDD of Row
> 
>
> Key: SPARK-3592
> URL: https://issues.apache.org/jira/browse/SPARK-3592
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, we can not appy schema to a RDD of Row, this should be a Bug,
> {code}
> >>> srdd = sqlCtx.jsonRDD(sc.parallelize(["""{"a":2}"""]))
> >>> sqlCtx.applySchema(srdd.map(lambda x:x), srdd.schema())
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/daviesliu/work/spark/python/pyspark/sql.py", line 1121,
> in applySchema
> _verify_type(row, schema)
>   File "/Users/daviesliu/work/spark/python/pyspark/sql.py", line 736,
> in _verify_type
> % (dataType, type(obj)))
> TypeError: StructType(List(StructField(a,IntegerType,true))) can not
> accept abject in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2594) Add CACHE TABLE AS SELECT ...

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2594.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2397
[https://github.com/apache/spark/pull/2397]

> Add CACHE TABLE  AS SELECT ...
> 
>
> Key: SPARK-2594
> URL: https://issues.apache.org/jira/browse/SPARK-2594
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3501.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2368
[https://github.com/apache/spark/pull/2368]

> Hive SimpleUDF will create duplicated type cast which cause exception in 
> constant folding
> -
>
> Key: SPARK-3501
> URL: https://issues.apache.org/jira/browse/SPARK-3501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.2.0
>
>
> When do the query like:
> select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as 
> timestamp)) from src;
> SparkSQL will raise exception:
> {panel}
> [info] - Cast Timestamp to Timestamp in UDF *** FAILED ***
> [info]   scala.MatchError: TimestampType (of class 
> org.apache.spark.sql.catalyst.types.TimestampType$)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141438#comment-14141438
 ] 

Apache Spark commented on SPARK-3608:
-

User 'vidaha' has created a pull request for this issue:
https://github.com/apache/spark/pull/2466

> Spark EC2 Script does not correctly break when AWS tagging succeeds.
> 
>
> Key: SPARK-3608
> URL: https://issues.apache.org/jira/browse/SPARK-3608
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Vida Ha
>Priority: Critical
>
> Spark EC2 script will tag 5 times and not break out correctly if things 
> succeed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.

2014-09-19 Thread Vida Ha (JIRA)
Vida Ha created SPARK-3608:
--

 Summary: Spark EC2 Script does not correctly break when AWS 
tagging succeeds.
 Key: SPARK-3608
 URL: https://issues.apache.org/jira/browse/SPARK-3608
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Vida Ha
Priority: Critical


Spark EC2 script will tag 5 times and not break out correctly if things succeed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3607) ConnectionManager threads.max configs on the thread pools don't work

2014-09-19 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3607:


 Summary: ConnectionManager threads.max configs on the thread pools 
don't work
 Key: SPARK-3607
 URL: https://issues.apache.org/jira/browse/SPARK-3607
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Priority: Minor


In the ConnectionManager we have a bunch of thread pools. They have settings 
for the maximum number of threads for each Threadpool (like 
spark.core.connection.handler.threads.max). 

Those configs don't work because its using a unbounded queue. From the 
threadpoolexecutor javadoc page: no more than corePoolSize threads will ever be 
created. (And the value of the maximumPoolSize therefore doesn't have any 
effect.)

luckily this doesn't matter to much as you can work around it by just 
increasing the minimum like spark.core.connection.handler.threads.min. 

These configs aren't documented either so its more of an internal thing when 
someone is reading the code.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3491) Use pickle to serialize the data in MLlib Python

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3491.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2378
[https://github.com/apache/spark/pull/2378]

> Use pickle to serialize the data in MLlib Python
> 
>
> Key: SPARK-3491
> URL: https://issues.apache.org/jira/browse/SPARK-3491
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> Currently, we write the code for serialization/deserialization in Python and 
> Scala manually, it can not scale to the big number of MLlib API.
> If the serialization could be done in pickle (using Pyrolite in JVM) in 
> extensional way, then it should be much easy to add Python API for MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-19 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141382#comment-14141382
 ] 

Matei Zaharia commented on SPARK-3129:
--

So Hari, what is the maximum sustainable rate in MB/second? That's the number 
we should be looking for. I think a latency of 50-100 ms to flush is fine, but 
we can't be writing just 5 Kbytes/second.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1701) Inconsistent naming: "slice" or "partition"

2014-09-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1701.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2304
[https://github.com/apache/spark/pull/2304]

> Inconsistent naming: "slice" or "partition"
> ---
>
> Key: SPARK-1701
> URL: https://issues.apache.org/jira/browse/SPARK-1701
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>  Labels: starter
> Fix For: 1.2.0
>
>
> Throughout the documentation and code "slice" and "partition" are used 
> interchangeably. (Or so it seems to me.) It would avoid some confusion for 
> new users to settle on one name. I think "partition" is winning, since that 
> is the name of the class representing the concept.
> This should not be much more complicated to do than a search & replace. I can 
> take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141344#comment-14141344
 ] 

Apache Spark commented on SPARK-1853:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2464

> Show Streaming application code context (file, line number) in Spark Stages UI
> --
>
> Key: SPARK-1853
> URL: https://issues.apache.org/jira/browse/SPARK-1853
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>Assignee: Mubarak Seyed
> Fix For: 1.2.0
>
> Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png
>
>
> Right now, the code context (file, and line number) shown for streaming jobs 
> in stages UI is meaningless as it refers to internal DStream: 
> rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-09-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141343#comment-14141343
 ] 

Sean Owen commented on SPARK-3573:
--

It had also not occurred to me to use an abstraction for SQL-like queries over 
tabular data here. 

On a practical level it means dragging in Spark SQL to do ML and that brings a 
fresh set of jar hell issues for apps that otherwise don't need it. 

The data types that are needed in a data frame-like structure are different. 
For example, it is important to know if a column will take on 2 or N distinct 
values and expose that via API. Spark SQL would ignore or hide this kind of 
info. Float/Double types don't matter per se to ML but matter in SQL. 

I don't think this means Spark SQL could not be a source of a data frame-like 
thing but I would not expect to conflate them. The bits used from ML do seem 
like a distinct and much simpler API that expects to be found in ML or core and 
not tied to SQL. 

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(iteractor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(raw, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal

2014-09-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141336#comment-14141336
 ] 

Patrick Wendell commented on SPARK-2271:


>From the Hive JIRA it looks like this was originally a microsoft package. 
>Should we just get the package directly or is it necessary to get it from hive?

> Use Hive's high performance Decimal128 to replace BigDecimal
> 
>
> Key: SPARK-2271
> URL: https://issues.apache.org/jira/browse/SPARK-2271
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>
> Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141313#comment-14141313
 ] 

Apache Spark commented on SPARK-3604:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2463

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Critical
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-09-19 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3606:
-

 Summary: Spark-on-Yarn AmIpFilter does not work with Yarn HA.
 Key: SPARK-3606
 URL: https://issues.apache.org/jira/browse/SPARK-3606
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin


The current IP filter only considers one of the RMs in an HA setup. If the 
active RM is not the configured one, you get a "connection refused" error when 
clicking on the Spark AM links in the RM UI.

Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-09-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141271#comment-14141271
 ] 

Xiangrui Meng commented on SPARK-3573:
--

[~sandyr] SQL/Streaming/GraphX provide computation frameworks, while MLlib is 
to build machine learning pipelines. It is natural to take leverage on the 
tools we have built inside Spark. For example, we included streaming machine 
learning algorithms in 1.1 and we are plan to implement LDA using GraphX. I'm 
not worried about MLlib depending on SQL. MLlib can provide UDFs related to 
machine learning. It will be an extension to SQL but SQL doesn't depend on 
MLlib. There are not many types in ML. One thing we want to add is Vector, and 
its transformations are supported by MLlib's transformers. With weak types, we 
cannot prevent users declare a string column as numeric, but errors will be 
generated at runtime.

[~epahomov] If we are talking about a single machine learning algorithm, label, 
feature, and perhaps weight should be sufficient. However, for a data pipeline, 
we need more flexible operations. I think we should make it easier for users to 
construct such a pipeline. Libraries like R and Pandas support dataframes, 
which is very similar to SchemaRDD, while the latter also provides execution 
plan. Do we need execution plan? Maybe not in the first stage but we definitely 
need it for future optimization. For training, we use label/features, and for 
prediction, we need id/features. Spark SQL can figure out the columns needed 
and optimize it if the underlying storage is in columnar format.

One useful thing we can try is to write down some sample code to construct a 
pipeline with couple components and re-apply the pipeline to test data. Then 
take look at the code as users and see whether it is simple to use. At the 
beginning, I tried to define Instance similar to Weka 
(https://github.com/mengxr/spark-ml/blob/master/doc/instance.md), but it 
doesn't work well to address those pipelines.

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(iteractor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(raw, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2175) Null values when using App trait.

2014-09-19 Thread Brandon Amos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141217#comment-14141217
 ] 

Brandon Amos commented on SPARK-2175:
-

Hi, does the following snippet from the mailing list post linked above
still exhibit this behavior?

  val suffix = "-suffix" 
  val l = sc.parallelize(List("a", "b", "c")) 
  println(l.map(_+suffix).collect().mkString(",")) 

> Null values when using App trait.
> -
>
> Key: SPARK-2175
> URL: https://issues.apache.org/jira/browse/SPARK-2175
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Linux
>Reporter: Brandon Amos
>Priority: Trivial
>
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-951) Gaussian Mixture Model

2014-09-19 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141124#comment-14141124
 ] 

Anant Daksh Asthana commented on SPARK-951:
---

caizhua Could you please elaborate a little more on the issue? right now 'This 
code' and 'input file named Gmm_spark.tbl' are unknown to me at the time of 
reading this

> Gaussian Mixture Model
> --
>
> Key: SPARK-951
> URL: https://issues.apache.org/jira/browse/SPARK-951
> Project: Spark
>  Issue Type: Story
>  Components: Examples
>Affects Versions: 0.7.3
>Reporter: caizhua
>Priority: Critical
>  Labels: Learning, Machine, Model
>
> This code includes the code for Gaussian Mixture Model. The input file named 
> Gmm_spark.tbl is the input for this program.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3536:

Assignee: Ravindra Pesala

> SELECT on empty parquet table throws exception
> --
>
> Key: SPARK-3536
> URL: https://issues.apache.org/jira/browse/SPARK-3536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ravindra Pesala
>  Labels: starter
>
> Reported by [~matei].  Reproduce as follows:
> {code}
> scala> case class Data(i: Int)
> defined class Data
> scala> createParquetFile[Data]("testParquet")
> scala> parquetFile("testParquet").count()
> 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due 
> to exception - job: 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
>   at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
>   at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-09-19 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141063#comment-14141063
 ] 

Sandy Ryza commented on SPARK-3573:
---

Currently SchemaRDD does depend on Catalyst.  Are you thinking we'd take that 
out?

I wasn't thinking about any specific drawbacks, unless SQL might need to depend 
on MLLib as well?  I guess I'm thinking about it more from the perspective of 
what mental model we expect users to have when dealing with Datasets.  
SchemaRDD brings along baggage like LogicalPlans - do users need to understand 
what that is?  SQL and ML types sometimes line up, sometimes have fuzzy 
relationships, and sometimes can't be translated.  How does the mapping get 
defined?  What stops someone from annotating a String column with "numeric"?


> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(iteractor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(raw, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3605) Typo in SchemaRDD JavaDoc

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141028#comment-14141028
 ] 

Apache Spark commented on SPARK-3605:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2460

> Typo in SchemaRDD JavaDoc
> -
>
> Key: SPARK-3605
> URL: https://issues.apache.org/jira/browse/SPARK-3605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sandy Ryza
>Priority: Trivial
>
> "Examples are loading data from Parquet files by using by using the"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3605) Typo in SchemaRDD JavaDoc

2014-09-19 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3605:
-

 Summary: Typo in SchemaRDD JavaDoc
 Key: SPARK-3605
 URL: https://issues.apache.org/jira/browse/SPARK-3605
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sandy Ryza
Priority: Trivial


"Examples are loading data from Parquet files by using by using the"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3583) Spark run slow after unexpected repartition

2014-09-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3583.

Resolution: Invalid

Hi there, can you start on the user list rather than on JIRA? We use JIRA to 
track well described bugs once we've pinpointed issues. Thanks!

> Spark run slow after unexpected repartition
> ---
>
> Key: SPARK-3583
> URL: https://issues.apache.org/jira/browse/SPARK-3583
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.1
>Reporter: ShiShu
>  Labels: easyfix
> Attachments: spark_q_001.jpg, spark_q_004.jpg, spark_q_005.jpg, 
> spark_q_006.jpg
>
>
> Hi dear all~
> My spark application sometimes runs much slower than it use to be, so I 
> wonder why would this happen.
> I find out that after a repartition stage of stage 17, all tasks go to one 
> executor. But in my code, I only use repartition at the very beginning.  
> In my application, before stage 17,  every stage run sucessfully within 1 
> minute, but after stage 17, it cost more than 10 minutes for every stage. 
> Normally my application runs succcessfully and will finish within 9 minites.
> My spark version is 0.9.1,  and my program is writen by scala.
> I take some screenshots but don't know how to post it, pls tell me if you 
> need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2175) Null values when using App trait.

2014-09-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140947#comment-14140947
 ] 

Patrick Wendell commented on SPARK-2175:


Thanks for reporting this - can someone provide a concise example here to post 
int the JIRA body? Thanks!

> Null values when using App trait.
> -
>
> Key: SPARK-2175
> URL: https://issues.apache.org/jira/browse/SPARK-2175
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Linux
>Reporter: Brandon Amos
>Priority: Trivial
>
> See 
> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3604:
---
Target Version/s: 1.2.0

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Critical
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3604:
---
Priority: Critical  (was: Blocker)

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Critical
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140940#comment-14140940
 ] 

Patrick Wendell commented on SPARK-3604:


Yeah good catch, we should fix this.

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Critical
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3599) Avoid loading and printing properties file content frequently

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140896#comment-14140896
 ] 

Apache Spark commented on SPARK-3599:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2454

> Avoid loading and printing properties file content frequently
> -
>
> Key: SPARK-3599
> URL: https://issues.apache.org/jira/browse/SPARK-3599
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: WangTaoTheTonic
>Priority: Minor
> Attachments: too many verbose.txt
>
>
> When I use -v | -verbos in spark-submit, there prints lots of message about 
> contents in properties file. 
> After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I 
> found the "getDefaultSparkProperties" method is invoked in three places, and 
> every time we invoke it, we load properties from properties file, and print 
> again if option -v used.
> We might should use a value instead of method when we use default properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140898#comment-14140898
 ] 

Apache Spark commented on SPARK-3536:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2456

> SELECT on empty parquet table throws exception
> --
>
> Key: SPARK-3536
> URL: https://issues.apache.org/jira/browse/SPARK-3536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>  Labels: starter
>
> Reported by [~matei].  Reproduce as follows:
> {code}
> scala> case class Data(i: Int)
> defined class Data
> scala> createParquetFile[Data]("testParquet")
> scala> parquetFile("testParquet").count()
> 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due 
> to exception - job: 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
>   at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
>   at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3598) cast to timestamp should be the same as hive

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140900#comment-14140900
 ] 

Apache Spark commented on SPARK-3598:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2458

> cast to timestamp should be the same as hive
> 
>
> Key: SPARK-3598
> URL: https://issues.apache.org/jira/browse/SPARK-3598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> select cast(1000 as timestamp) from src limit 1;
> should return 1970-01-01 00:00:01
> also, current implementation has bug when the time is before 1970-01-01 
> 00:00:00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140897#comment-14140897
 ] 

Apache Spark commented on SPARK-3250:
-

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/2455

> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3268) DoubleType should support modulus

2014-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140899#comment-14140899
 ] 

Apache Spark commented on SPARK-3268:
-

User 'gvramana' has created a pull request for this issue:
https://github.com/apache/spark/pull/2457

> DoubleType should support modulus 
> --
>
> Key: SPARK-3268
> URL: https://issues.apache.org/jira/browse/SPARK-3268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chris Grier
>Priority: Minor
>
> Using the modulus operator (%) on Doubles throws and exception. 
> eg: 
> SELECT 1388632775.0 % 60 from tablename LIMIT 1
> Throws: 
> java.lang.Exception: Type DoubleType does not support numeric operations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Eric Friedman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140713#comment-14140713
 ] 

Eric Friedman commented on SPARK-3604:
--

many more frames of the same content than pasted above, of course.

Looks like the problem is here:

  override def getPartitions: Array[Partition] = {
val array = new Array[Partition](rdds.map(_.partitions.size).sum)

and should either be solved iteratively or as a tail call.

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Blocker
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-19 Thread Eric Friedman (JIRA)
Eric Friedman created SPARK-3604:


 Summary: unbounded recursion in getNumPartitions triggers stack 
overflow for large UnionRDD
 Key: SPARK-3604
 URL: https://issues.apache.org/jira/browse/SPARK-3604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: linux.  Used python, but error is in Scala land.
Reporter: Eric Friedman
Priority: Blocker


I have a large number of parquet files all with the same schema and attempted 
to make a UnionRDD out of them.

When I call getNumPartitions(), I get a stack overflow error

that looks like this:

Py4JJavaError: An error occurred while calling o3275.partitions.
: java.lang.StackOverflowError
at 
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13

2014-09-19 Thread Greg Senia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140657#comment-14140657
 ] 

Greg Senia commented on SPARK-2706:
---

We have been using this fix for a few weeks now against Hive 13. The only 
outstanding issue I see and this could be something larger is the fact that 
Spark Thrift service doesn't seem to support the hive.server2.enable.doAs = 
true. It doesn't set proxy user.

> Enable Spark to support Hive 0.13
> -
>
> Key: SPARK-2706
> URL: https://issues.apache.org/jira/browse/SPARK-2706
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Chunjun Xiao
>Assignee: Zhan Zhang
> Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, 
> spark-hive.err, v1.0.2.diff
>
>
> It seems Spark cannot work with Hive 0.13 well.
> When I compiled Spark with Hive 0.13.1, I got some error messages, as 
> attached below.
> So, when can Spark be enabled to support Hive 0.13?
> Compiling Error:
> {quote}
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
>  type mismatch;
>  found   : String
>  required: Array[String]
> [ERROR]   val proc: CommandProcessor = 
> CommandProcessorFactory.get(tokens(0), hiveconf)
> [ERROR]  ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
>  overloaded method constructor TableDesc with alternatives:
>   (x$1: Class[_ <: org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
> Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
> 
>   ()org.apache.hadoop.hive.ql.plan.TableDesc
>  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
> Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
> value tableDesc)(in value tableDesc)], java.util.Properties)
> [ERROR]   val tableDesc = new TableDesc(
> [ERROR]   ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
>  value getPartitionPath is not a member of 
> org.apache.hadoop.hive.ql.metadata.Partition
> [ERROR]   val partPath = partition.getPartitionPath
> [ERROR]^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
>  value appendReadColumnNames is not a member of object 
> org.apache.hadoop.hive.serde2.ColumnProjectionUtils
> [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
> attributes.map(_.name))
> [ERROR]   ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
>  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
> [ERROR]   new HiveDecimal(bd.underlying())
> [ERROR]   ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
>  type mismatch;
>  found   : org.apache.hadoop.fs.Path
>  required: String
> [ERROR]   
> SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
> [ERROR]   ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
>  value getExternalTmpFileURI is not a member of 
> org.apache.hadoop.hive.ql.Context
> [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
> [ERROR]   ^
> [ERROR] 
> /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
>  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
> [ERROR]   case bd: BigDecimal => new HiveDecimal(bd.underlying())
> [ERROR]  ^
> [ERROR] 8 errors found
> [DEBUG] Compilation failed (CompilerInterface)
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
> [INFO] Spark Project Core  SUCCESS [2:39.805s]
> [INFO] Spark Project Bagel ... SUCCESS [21.148s]
> [INFO] Spark Project GraphX .. SUCCESS [59.950s]
> [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
> [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
> [INFO] Spark Project Tools ... SUCCESS [15.405s]
> [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
> [INFO] Spark Project SQL . SUCCESS 

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-09-19 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140647#comment-14140647
 ] 

Imran Rashid commented on SPARK-2365:
-

This looks fantastic.  I think it will also see heavy use outside of GraphX as 
well.

I only have one question about the api -- I think the name `IndexedRDD` is more 
appropriate for an interface, not this concrete implementation.  I can imagine 
other indexing strategies that would also feel like a `IndexedRDD`.  Its always 
hard to know when its worth putting in an interface and you will actually need 
to put in more concrete implementations, but this seems like a good candidate 
to me.  You could partially deal with this by having a companion object to the 
interface, which constructs this particular implementation.  (Eg., the way the 
`IndexedSeq` companion object's `apply()` methods build a `Vector`).

Not that I have any great recommendations for a better name ... 
`UpdateableHashIndexedRDD` maybe?

my other comments on the design can probably be handled in future work, but 
while they are on my mind:

# Is there any way to save & load an `IndexedRDD` from hdfs?  It seems like 
you'll always need to reshuffle the data when you load it if you just do 
`IndexedRDD(sc.hadoopFile(...))`. It seems we need a way for a hadoop file to 
be loaded with an "assumed Partitioner" (I thought I opened a ticket for that a 
while ago, but I can't find it ... I might open another one).  Also you might 
want some way to load the index from disk as well, though I suppose rebuilding 
that isn't tooo painful.

# I'm wondering about whether we should add a "bulk multiget".  Eg., say you're 
IndexedRDD is 1 B entries, and you want to look up and process 100K of them in 
parallel.  You probably don't want to do a sequential scan ... but you also 
don't want to call `multiget` which will pull the records onto the driver.  
Actually, as I'm writing this, I'm realizing there is probably a better way to 
do this -- you should just make another RDD out of your 100K elements, and then 
do an innerjoin.  Does that sound right?  We could add a convenience method for 
this -- but maybe I'm the only one who wants this so its premature to do 
anything about it.

again, I think this is a fantastic addition!  I'm looking through the code now, 
but so far it all seems great.

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization

2014-09-19 Thread Tomasz Dudziak (JIRA)
Tomasz Dudziak created SPARK-3603:
-

 Summary: InvalidClassException on a Linux VM - probably problem 
with serialization
 Key: SPARK-3603
 URL: https://issues.apache.org/jira/browse/SPARK-3603
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.0.0
 Environment: Linux version 2.6.32-358.32.3.el6.x86_64 
(mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014

java version "1.7.0_25"
OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Spark (either 1.0.0 or 1.1.0)

Reporter: Tomasz Dudziak
Priority: Critical


I have a Scala app connecting to a standalone Spark cluster. It works fine on 
Windows or on a Linux VM; however, when I try to run the app and the Spark 
cluster on another Linux VM (the same Linux kernel, Java and Spark - tested for 
versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of similar 
to the Big-Endian (IBM Power7) Spark Serialization issue (SPARK-2018), but... 
my system is definitely little endian and I understand the big endian issue 
should be already fixed in Spark 1.1.0 anyway. I'd appreaciate your help.

01:34:53.251 WARN  [Result resolver thread-0][TaskSetManager] Lost TID 2 (task 
1.0:2)
01:34:53.278 WARN  [Result resolver thread-0][TaskSetManager] Loss was due to 
java.io.InvalidClassException
java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class 
incompatible: stream classdesc serialVersionUID = -4937928798201944954, local 
class serialVersionUID = -8102093212602380348
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.jav

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140396#comment-14140396
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Thanks, Sam! Posted to OpenBLAS: https://github.com/xianyi/OpenBLAS/issues/452

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Sam Halliday (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140319#comment-14140319
 ] 

Sam Halliday commented on SPARK-3403:
-

thanks guys. This looks like its even more upstream of me. Would be good if you 
can submit to OpenBLAS.

I've never seen great gains in OpenBLAS over ATLAS, and certainly the AMD/Intel 
versions are far superior so I recommend them if performance is really critical.

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-19 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140269#comment-14140269
 ] 

Ravindra Pesala commented on SPARK-3536:


[~isaias.barroso] I have submitted the PR  4 hours ago,but I am not sure why it 
is not yet linked it to jira.

> SELECT on empty parquet table throws exception
> --
>
> Key: SPARK-3536
> URL: https://issues.apache.org/jira/browse/SPARK-3536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>  Labels: starter
>
> Reported by [~matei].  Reproduce as follows:
> {code}
> scala> case class Data(i: Int)
> defined class Data
> scala> createParquetFile[Data]("testParquet")
> scala> parquetFile("testParquet").count()
> 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due 
> to exception - job: 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
>   at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
>   at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-19 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140260#comment-14140260
 ] 

Egor Pakhomov commented on SPARK-3530:
--

Nice doc. 
Parameters passing as part of grid search and pipeline creation great and 
important feature, but it's only one of the features. For me it's more 
important to see Estimator abstraction in spark code base early, may be not 
earlier than introducing dataset abstraction, but definitely earlier than any 
work on grid search. 

When we where thinking on creating such pipeline framework we came to 
conclusion that transformations in this pipeline is like steps in oozie 
workflow - they should be easy retrieble, be persisted, and have some queue. 
It's because transformation can take hours and rerun the whole pipeline in case 
of step failure is expensive. Pipeline can consist of gridsearch with 
parameters search, which means, that there are a lot of parallel executions, 
which need wise scheduling. So I think pipeline should be executed on some 
cluster wise scheduler with some persistence. I'm not saying, that it's 
absolutly necessary now, but it would be great to have architecture open to 
such possibility. 

> Pipeline and Parameters
> ---
>
> Key: SPARK-3530
> URL: https://issues.apache.org/jira/browse/SPARK-3530
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3602) Can't run cassandra_inputformat.py

2014-09-19 Thread Frens Jan Rumph (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236
 ] 

Frens Jan Rumph edited comment on SPARK-3602 at 9/19/14 9:45 AM:
-

When running this against the spark-1.1.0-bin-hadoop build I get the following 
output:

{noformat}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a 
loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0)
14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); 
users with modify permissions: Set(frens-jan, )
14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started
14/09/19 11:24:31 INFO Remoting: Starting remoting
14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on 
port 44417.
14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker
14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster
14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140919112432-527c
14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager 
for block manager' on port 44978.
14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = 
ConnectionManagerId(laptop-x.local,44978)
14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager
14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager 
laptop-x.local:44978 with 265.4 MB RAM
14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager
14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7
14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server
14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' 
on port 54892.
14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
14/09/19 11:24:32 INFO SparkUI: Started SparkUI at 
http://laptop-x.local:4040
14/09/19 11:24:33 INFO SparkContext: Added JAR 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with 
timestamp 148673018
14/09/19 11:24:33 INFO Utils: Copying 
/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py
14/09/19 11:24:33 INFO SparkContext: Added file 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 
148673019
14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=0, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 34.2 KB, free 265.4 MB)
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=34980, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 34.2 KB, free 265.3 MB)
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter
14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70
14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) 
with 1 output partitions (allowLocal=true)
14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at 
SerDeUtil.scala:70)
14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List()
14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List()
14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
PythonHadoopUtil.scala:185), which has no missing parents
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with 
curM

[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-19 Thread Isaias Barroso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140239#comment-14140239
 ] 

Isaias Barroso commented on SPARK-3536:
---

Ravindra Pesala, I've fixed it on a BRANCH and ran the Test Suite for parquet, 
if someone hasn't started to fix, I can submit a PR. If ok please assign me to 
this Issue.

> SELECT on empty parquet table throws exception
> --
>
> Key: SPARK-3536
> URL: https://issues.apache.org/jira/browse/SPARK-3536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>  Labels: starter
>
> Reported by [~matei].  Reproduce as follows:
> {code}
> scala> case class Data(i: Int)
> defined class Data
> scala> createParquetFile[Data]("testParquet")
> scala> parquetFile("testParquet").count()
> 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due 
> to exception - job: 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
>   at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
>   at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3602) Can't run cassandra_inputformat.py

2014-09-19 Thread Frens Jan Rumph (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236
 ] 

Frens Jan Rumph edited comment on SPARK-3602 at 9/19/14 9:26 AM:
-

When running this against the spark-1.1.0-bin-hadoop build I get the following 
output:

{noformat}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a 
loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0)
14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); 
users with modify permissions: Set(frens-jan, )
14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started
14/09/19 11:24:31 INFO Remoting: Starting remoting
14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on 
port 44417.
14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker
14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster
14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140919112432-527c
14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager 
for block manager' on port 44978.
14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = 
ConnectionManagerId(laptop-x.local,44978)
14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager
14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager 
laptop-x.local:44978 with 265.4 MB RAM
14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager
14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7
14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server
14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' 
on port 54892.
14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
14/09/19 11:24:32 INFO SparkUI: Started SparkUI at 
http://laptop-x.local:4040
14/09/19 11:24:33 INFO SparkContext: Added JAR 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with 
timestamp 148673018
14/09/19 11:24:33 INFO Utils: Copying 
/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py
14/09/19 11:24:33 INFO SparkContext: Added file 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 
148673019
14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=0, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 34.2 KB, free 265.4 MB)
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=34980, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 34.2 KB, free 265.3 MB)
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter
14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70
14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) 
with 1 output partitions (allowLocal=true)
14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at 
SerDeUtil.scala:70)
14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List()
14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List()
14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
PythonHadoopUtil.scala:185), which has no missing parents
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with 
curM

[jira] [Commented] (SPARK-3602) Can't run cassandra_inputformat.py

2014-09-19 Thread Frens Jan Rumph (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236
 ] 

Frens Jan Rumph commented on SPARK-3602:


When running this against the spark-1.1.0-bin-hadoop build I get the following 
output:

{noformat}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a 
loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0)
14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan,
14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); 
users with modify permissions: Set(frens-jan, )
14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started
14/09/19 11:24:31 INFO Remoting: Starting remoting
14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@laptop-x.local:44417]
14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on 
port 44417.
14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker
14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster
14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140919112432-527c
14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager 
for block manager' on port 44978.
14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = 
ConnectionManagerId(laptop-x.local,44978)
14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager
14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager 
laptop-x.local:44978 with 265.4 MB RAM
14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager
14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7
14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server
14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' 
on port 54892.
14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
14/09/19 11:24:32 INFO SparkUI: Started SparkUI at 
http://laptop-x.local:4040
14/09/19 11:24:33 INFO SparkContext: Added JAR 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with 
timestamp 148673018
14/09/19 11:24:33 INFO Utils: Copying 
/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py
14/09/19 11:24:33 INFO SparkContext: Added file 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py
 at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 
148673019
14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=0, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 34.2 KB, free 265.4 MB)
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with 
curMem=34980, maxMem=278302556
14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 34.2 KB, free 265.3 MB)
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter
14/09/19 11:24:33 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter
14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70
14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) 
with 1 output partitions (allowLocal=true)
14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at 
SerDeUtil.scala:70)
14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List()
14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List()
14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
PythonHadoopUtil.scala:185), which has no missing parents
14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with 
curMem=69960, maxMem=278302556
14/09/19 11:24:33 INFO

[jira] [Commented] (SPARK-3573) Dataset

2014-09-19 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140206#comment-14140206
 ] 

Egor Pakhomov commented on SPARK-3573:
--

Can you tell more about your motivation to use ShemeRDD as class for Dataset? 
For me it seems that there are some problems with such decision:
1) Dataset has no connection with sql, rows or relational stuff. Make these two 
very different topic coexist in one class is quite dangerouse.
2) The only thing I see, which ShemeRDD and dataset have in common is some 
structure inside RDD. But this structure is very different - one is about rows 
- another is about features and labeles.  Not to mention collabarative 
filtering algorithms which could contain very different  structure. 

I can't see what motivate us use ShemeRDD as abstraction for dataset. 

> Dataset
> ---
>
> Key: SPARK-3573
> URL: https://issues.apache.org/jira/browse/SPARK-3573
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
> ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, 
> we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more 
> refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 
> 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
> event.action AS label,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(iteractor.features, Map("genderMatch" -> Array("userGender", 
> "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", 
> "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
> "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
> treeClassifier)
> val model = pipeline.fit(raw, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>  user.gender AS userGender, user.country AS userCountry, 
> user.features AS userFeatures,
>  ad.targetGender AS targetGender
> FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
> ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3602) Can't run cassandra_inputformat.py

2014-09-19 Thread Frens Jan Rumph (JIRA)
Frens Jan Rumph created SPARK-3602:
--

 Summary: Can't run cassandra_inputformat.py
 Key: SPARK-3602
 URL: https://issues.apache.org/jira/browse/SPARK-3602
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu 14.04
Reporter: Frens Jan Rumph


When I execute:
{noformat}
wget http://apache.cs.uu.nl/dist/spark/spark-1.1.0/spark-1.1.0-bin-hadoop2.4.tgz
tar xzf spark-1.1.0-bin-hadoop2.4.tgz
cd spark-1.1.0-bin-hadoop2.4/
./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop2.4.0.jar 
examples/src/main/python/cassandra_inputformat.py localhost keyspace cf
{noformat}

The output is:
{noformat}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/19 10:41:10 WARN Utils: Your hostname, laptop-x resolves to a 
loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0)
14/09/19 10:41:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/09/19 10:41:10 INFO SecurityManager: Changing view acls to: frens-jan,
14/09/19 10:41:10 INFO SecurityManager: Changing modify acls to: frens-jan,
14/09/19 10:41:10 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); 
users with modify permissions: Set(frens-jan, )
14/09/19 10:41:11 INFO Slf4jLogger: Slf4jLogger started
14/09/19 10:41:11 INFO Remoting: Starting remoting
14/09/19 10:41:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@laptop-x.local:43790]
14/09/19 10:41:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@laptop-x.local:43790]
14/09/19 10:41:11 INFO Utils: Successfully started service 'sparkDriver' on 
port 43790.
14/09/19 10:41:11 INFO SparkEnv: Registering MapOutputTracker
14/09/19 10:41:11 INFO SparkEnv: Registering BlockManagerMaster
14/09/19 10:41:11 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140919104111-145e
14/09/19 10:41:11 INFO Utils: Successfully started service 'Connection manager 
for block manager' on port 45408.
14/09/19 10:41:11 INFO ConnectionManager: Bound socket to port 45408 with id = 
ConnectionManagerId(laptop-x.local,45408)
14/09/19 10:41:11 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/09/19 10:41:11 INFO BlockManagerMaster: Trying to register BlockManager
14/09/19 10:41:11 INFO BlockManagerMasterActor: Registering block manager 
laptop-x.local:45408 with 265.4 MB RAM
14/09/19 10:41:11 INFO BlockManagerMaster: Registered BlockManager
14/09/19 10:41:11 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-5f0289d7-9b20-4bd7-a713-db84c38c4eac
14/09/19 10:41:11 INFO HttpServer: Starting HTTP Server
14/09/19 10:41:11 INFO Utils: Successfully started service 'HTTP file server' 
on port 36556.
14/09/19 10:41:11 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
14/09/19 10:41:11 INFO SparkUI: Started SparkUI at 
http://laptop-frens-jan.local:4040
14/09/19 10:41:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/09/19 10:41:12 INFO SparkContext: Added JAR 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/lib/spark-examples-1.1.0-hadoop2.4.0.jar
 at http://192.168.2.2:36556/jars/spark-examples-1.1.0-hadoop2.4.0.jar with 
timestamp 146072417
14/09/19 10:41:12 INFO Utils: Copying 
/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py
 to /tmp/spark-7dbb1b4d-016c-4f8b-858d-f79c9297f58f/cassandra_inputformat.py
14/09/19 10:41:12 INFO SparkContext: Added file 
file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py
 at http://192.168.2.2:36556/files/cassandra_inputformat.py with timestamp 
146072419
14/09/19 10:41:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@laptop-frens-jan.local:43790/user/HeartbeatReceiver
14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with 
curMem=0, maxMem=278302556
14/09/19 10:41:12 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 163.7 KB, free 265.3 MB)
14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with 
curMem=167659, maxMem=278302556
14/09/19 10:41:12 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 163.7 KB, free 265.1 MB)
14/09/19 10:41:12 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter
14/09/19 10:41:12 INFO Converter: Loaded converter: 
org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter
Traceback (most recent call last):
  File 
"/home/frens-jan/Desktop/spark-1.1.0-bin-had

[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-19 Thread Gaurav Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140160#comment-14140160
 ] 

Gaurav Mishra commented on SPARK-3434:
--

A matrix being represented by multiple RDDs of sub-matrices may be helpful when 
an operation on the matrix requires computation over only a small set of its 
sub-matrices. However, operations like matrix multiplication require 
computation over all elements in the matrix (i.e. all elements need to be 
read). Therefore, at least in the case of matrix multiplication, keeping a 
single RDD seems to be a better idea. Keeping multiple RDDs in that case will 
only burden us further with the task of keeping track of all sub matrices.

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception

2014-09-19 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140145#comment-14140145
 ] 

Ravindra Pesala commented on SPARK-3536:


It return null metadata from parquet if querying on empty parquet file while 
calculating splits.So we should add null check and returns the empty splits 
solves the issue.

> SELECT on empty parquet table throws exception
> --
>
> Key: SPARK-3536
> URL: https://issues.apache.org/jira/browse/SPARK-3536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>  Labels: starter
>
> Reported by [~matei].  Reproduce as follows:
> {code}
> scala> case class Data(i: Int)
> defined class Data
> scala> createParquetFile[Data]("testParquet")
> scala> parquetFile("testParquet").count()
> 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due 
> to exception - job: 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438)
>   at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
>   at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": 
"Temperature", "value": "30"},{"name": "Speed", "value": "60"},{"name": 
"Location", "value": ["+401213.1", "-0750015.1"]},{"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.
BTW the messages were pulled from kafka source, and decoded by using kafka 
decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.schedul

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.
BTW the messages were pulled from kafka source, and decoded by using kafka 
decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.sche

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}}

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.
BTW the messages were pulled from kafka source, and decoded by using kafka 
decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.sched

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.
BTW the messages were pulled from kafka source, and decoded by using kafka 
decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.sche

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.
BTW the messages were pulled from kafka source, and decoded by using kafka 
decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.sched

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Description: 
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good. Note that i have registered all the Avro 
generated classes with kryo. Im using Java as programming language.

when used complex message throws NPE, stack trace as follows:
==
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResMsg) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResMsg are project specific classes generated 
by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResMsg { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.

By the way the messages were pulled from kafka source, and decoded by using 
kafka decoder.

  was:
Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good.

Note that i have registered all the Avro generated classes with kryo.
im using Java as programming language.

when used complex message throws NPE, stack trace as follows:

ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResourceMessage) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.sc

[jira] [Created] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-19 Thread mohan gaddam (JIRA)
mohan gaddam created SPARK-3601:
---

 Summary: Kryo NPE for output operations on Avro complex Objects 
even after registering.
 Key: SPARK-3601
 URL: https://issues.apache.org/jira/browse/SPARK-3601
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: local, standalone cluster
Reporter: mohan gaddam


Kryo serializer works well when avro objects has simple data. but when the same 
avro object has complex data(like unions/arrays) kryo fails while output 
operations. but mappings are good.

Note that i have registered all the Avro generated classes with kryo.
im using Java as programming language.

when used complex message throws NPE, stack trace as follows:

ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
ms.0 
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException 
Serialization trace: 
value (xyz.Datum) 
data (xyz.ResourceMessage) 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 


In the above exception, Datum and ResourceMessage are project specific classes 
generated by avro using the below avdl snippet:
==
record KeyValueObject { 
union{boolean, int, long, float, double, bytes, string} name; 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record Datum { 
union {boolean, int, long, float, double, bytes, string, 
array, 
KeyValueObject} value; 
} 
record ResourceMessage { 
string version; 
string sequence; 
string resourceGUID; 
string GWID; 
string GWTimestamp; 
union {Datum, array} data; 
}

avro message samples are as follows:

1)
{"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", 
"GWTimestamp": "1409823150737", "data": {"value": "30"}} 
2)
{"version": "01", "sequence": "1", "resource": "sensor-001", "controller": 
"002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": 
"Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": 
"Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", 
"value": "2014-09-09T08:15:25-05:00"}]}} 

both 1 and 2 adhere to the avro schema, so decoder is able to convert them into 
avro objects in spark streaming api.

By the way the messages were pulled from kafka source, and decoded by using 
kafka decoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/19/14 7:16 AM:
--

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS ( https://github.com/xianyi/OpenBLAS ) or netlib-java ( 
https://github.com/fommil/netlib-java )? I thought the latter has jni 
implementation. I it ok to submit it as is?


was (Author: avulanov):
Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-19 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140128#comment-14140128
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Posted to netlib-java: https://github.com/fommil/netlib-java/issues/72

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.2.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-19 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140124#comment-14140124
 ] 

Ravindra Pesala commented on SPARK-3298:


I guess, we should add some API like *SqlContext.isTableExists(tableName)* to 
check whether the table already exists or not. So by using this API user can 
check the table existence  and then register the table.
The current API *SqlContext.table(tableName)*  throws exception if the table is 
not present,so we cannot use it for this purpose. Please comment on it.


> [SQL] registerAsTable / registerTempTable overwrites old tables
> ---
>
> Key: SPARK-3298
> URL: https://issues.apache.org/jira/browse/SPARK-3298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Priority: Minor
>  Labels: newbie
>
> At least in Spark 1.0.2,  calling registerAsTable("a") when "a" had been 
> registered before does not cause an error.  However, there is no way to 
> access the old table, even though it may be cached and taking up space.
> How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Assignee: (was: Xiangrui Meng)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Component/s: (was: MLlib)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Issue Type: Improvement  (was: Bug)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Description: RDD's classTag is not passed in through CacheManager. So 
RDD[Double] uses object arrays for caching, which leads to huge overhead. 
However, we need to send the classTag down many levels to make it work.  (was: 
RandomDataGenerator doesn't have a classTag or @specilaized. So the generated 
RDDs are RDDs of objects, that cause huge storage overhead.)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
> object arrays for caching, which leads to huge overhead. However, we need to 
> send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2014-09-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3600:
-
Summary: RDD[Double] doesn't use primitive arrays for caching  (was: 
RandomRDDs doesn't create primitive typed RDDs)

> RDD[Double] doesn't use primitive arrays for caching
> 
>
> Key: SPARK-3600
> URL: https://issues.apache.org/jira/browse/SPARK-3600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RandomDataGenerator doesn't have a classTag or @specilaized. So the generated 
> RDDs are RDDs of objects, that cause huge storage overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org