date:20140918


 [ 
https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Guo updated SPARK-3581:
-
Affects Version/s: 1.0.0
   1.0.2

 RDD API(distinct/subtract) does not work for RDD of Dictionaries
 

 Key: SPARK-3581
 URL: https://issues.apache.org/jira/browse/SPARK-3581
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.2, 1.1.0
 Environment: Spark 1.0 1.1
 JDK 1.6
Reporter: Shawn Guo
Priority: Minor

 Construct a RDD of dictionaries(dictRDD), 
 try to use the RDD API, RDD.distinct() or RDD.subtract().
 {code:title=PySpark RDD API Test|borderStyle=solid}
 dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the 
 Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'}))
 dictRDD.distinct().collect()
 dictRDD.subtract(dictRDD).collect()
 {code}
 An error occurred while calling, 
 TypeError: unhashable type: 'dict'
 I'm not sure if it is a bug or expected results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3321) Defining a class within python main script


[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138569#comment-14138569
 ] 

Shawn Guo commented on SPARK-3321:
--

No idea yet, I use --py-files Null.py instead. it seems a pickle lib 
limitation and existed long time ago.

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at

[jira] [Created] (SPARK-3583) Spark run slow after unexpected repartition

2014-09-18 Thread ShiShu (JIRA)

ShiShu created SPARK-3583:
-

 Summary: Spark run slow after unexpected repartition
 Key: SPARK-3583
 URL: https://issues.apache.org/jira/browse/SPARK-3583
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: ShiShu


Hi dear all~
My spark application sometimes runs much slower than it use to be, so I wonder 
why would this happen.
I find out that after a repartition stage of stage 17, all tasks go to one 
executor. But in my code, I only use repartition at the very beginning.  
In my application, before stage 17,  every stage run sucessfully within 1 
minute, but after stage 17, it cost more than 10 minutes for every stage. 
Normally my application runs succcessfully and will finish within 9 minites.

My spark version is 0.9.1,  and my program is writen by scala.

I take some screenshots but don't know how to post it, pls tell me if you need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3583) Spark run slow after unexpected repartition

2014-09-18 Thread ShiShu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ShiShu updated SPARK-3583:
--
Attachment: spark_q_006.jpg
spark_q_005.jpg
spark_q_004.jpg
spark_q_001.jpg

these are the screenshots I take.

 Spark run slow after unexpected repartition
 ---

 Key: SPARK-3583
 URL: https://issues.apache.org/jira/browse/SPARK-3583
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: ShiShu
  Labels: easyfix
 Attachments: spark_q_001.jpg, spark_q_004.jpg, spark_q_005.jpg, 
 spark_q_006.jpg


 Hi dear all~
 My spark application sometimes runs much slower than it use to be, so I 
 wonder why would this happen.
 I find out that after a repartition stage of stage 17, all tasks go to one 
 executor. But in my code, I only use repartition at the very beginning.  
 In my application, before stage 17,  every stage run sucessfully within 1 
 minute, but after stage 17, it cost more than 10 minutes for every stage. 
 Normally my application runs succcessfully and will finish within 9 minites.
 My spark version is 0.9.1,  and my program is writen by scala.
 I take some screenshots but don't know how to post it, pls tell me if you 
 need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result

2014-09-18 Thread Ankur Dave (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138638#comment-14138638
 ] 

Ankur Dave commented on SPARK-3578:
---

[~pwendell] Sorry, I forgot to do that this time.

 GraphGenerators.sampleLogNormal sometimes returns too-large result
 --

 Key: SPARK-3578
 URL: https://issues.apache.org/jira/browse/SPARK-3578
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 GraphGenerators.sampleLogNormal is supposed to return an integer strictly 
 less than maxVal. However, it violates this guarantee. It generates its 
 return value as follows:
 {code}
 var X: Double = maxVal
 while (X = maxVal) {
   val Z = rand.nextGaussian()
   X = math.exp(mu + sigma*Z)
 }
 math.round(X.toFloat)
 {code}
 When X is sampled to be close to (but less than) maxVal, then it will pass 
 the while loop condition, but the rounded result will be equal to maxVal, 
 which will fail the test.
 For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
 math.round(X.toFloat) is 5.
 A solution is to round X down instead of to the nearest integer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1353) IllegalArgumentException when writing to disk

2014-09-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1353.

Resolution: Duplicate

 IllegalArgumentException when writing to disk
 -

 Key: SPARK-1353
 URL: https://issues.apache.org/jira/browse/SPARK-1353
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
 Environment: AWS EMR 3.2.30-49.59.amzn1.x86_64 #1 SMP  x86_64 
 GNU/Linux
 Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 built 2014-03-18
Reporter: Jim Blomo
Priority: Minor

 The Executor may fail when trying to mmap a file bigger than 
 Integer.MAX_VALUE due to the constraints of FileChannel.map 
 (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,
  long, long)).  The signature takes longs, but the size value must be less 
 than MAX_VALUE.  This manifests with the following backtrace:
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:98)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:337)
 at 
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:281)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:430)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:38)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
 at 
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3525) Gradient boosting in MLLib

2014-09-18 Thread Egor Pakhomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138674#comment-14138674
 ] 

Egor Pakhomov commented on SPARK-3525:
--

https://github.com/apache/spark/pull/2394

 Gradient boosting in MLLib
 --

 Key: SPARK-3525
 URL: https://issues.apache.org/jira/browse/SPARK-3525
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Egor Pakhomov
Assignee: Egor Pakhomov
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH

2014-09-18 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-3584:
-

 Summary: sbin/slaves doesn't work when we use password 
authentication for SSH
 Key: SPARK-3584
 URL: https://issues.apache.org/jira/browse/SPARK-3584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


In sbin/slaves, ssh command run in the background but if we use password 
authentication, background ssh command doesn't work so sbin/slaves doesn't work.

Also I suggest improvement for sbin/slaves.
In current implementation, slaves file is trucked by Git but it can be edited 
by user so we prepare slaves.template instead of slaves.

Default slaves file has one entry, localhost, so we should use localhost as a 
default host list.
I modified sbin/slaves to choose localhost as a default host list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH


[ 
https://issues.apache.org/jira/browse/SPARK-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138716#comment-14138716
 ] 

Apache Spark commented on SPARK-3584:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2444

 sbin/slaves doesn't work when we use password authentication for SSH
 

 Key: SPARK-3584
 URL: https://issues.apache.org/jira/browse/SPARK-3584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 In sbin/slaves, ssh command run in the background but if we use password 
 authentication, background ssh command doesn't work so sbin/slaves doesn't 
 work.
 Also I suggest improvement for sbin/slaves.
 In current implementation, slaves file is trucked by Git but it can be edited 
 by user so we prepare slaves.template instead of slaves.
 Default slaves file has one entry, localhost, so we should use localhost as a 
 default host list.
 I modified sbin/slaves to choose localhost as a default host list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3585) Probability Values

2014-09-18 Thread Tamilselvan Palani (JIRA)

Tamilselvan Palani created SPARK-3585:
-

 Summary: Probability Values
 Key: SPARK-3585
 URL: https://issues.apache.org/jira/browse/SPARK-3585
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Development
Reporter: Tamilselvan Palani
Priority: Blocker
 Fix For: 1.1.0


Unable to get actual probability values BY DEFAULT along with prediction, while 
running Logistic  Regression or Decision Tree

Please help with following questions:
1. Is there an API that can be called to force output to include Probability 
values along with predicted values?
2. Does it require customization? if so in which package/class?

Please let me know if I can provide any further information.

Thanks and Regards,

Tamilselvan Palani



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3585) Probability Values in Logistic Regression/Decision Tree output

2014-09-18 Thread Tamilselvan Palani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamilselvan Palani updated SPARK-3585:
--
Summary: Probability Values in Logistic Regression/Decision Tree output   
(was: Probability Values in Logistic Regression output )

 Probability Values in Logistic Regression/Decision Tree output 
 ---

 Key: SPARK-3585
 URL: https://issues.apache.org/jira/browse/SPARK-3585
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Development
Reporter: Tamilselvan Palani
Priority: Blocker
  Labels: newbie
 Fix For: 1.1.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 Unable to get actual probability values BY DEFAULT along with prediction, 
 while running Logistic  Regression or Decision Tree
 Please help with following questions:
 1. Is there an API that can be called to force output to include Probability 
 values along with predicted values?
 2. Does it require customization? if so in which package/class?
 Please let me know if I can provide any further information.
 Thanks and Regards,
 Tamilselvan Palani



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3586) spark streaming

2014-09-18 Thread wangxj (JIRA)

wangxj created SPARK-3586:
-

 Summary: spark streaming 
 Key: SPARK-3586
 URL: https://issues.apache.org/jira/browse/SPARK-3586
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: wangxj
Priority: Minor
 Fix For: 1.1.0


For  text files, the method streamingContext.textFileStream(dataDirectory). 
Spark Streaming will monitor the directory dataDirectory and process any files 
created in that directory.but files written in nested directories not supported

eg
streamingContext.textFileStream(/test). 
Look at the direction contents:
/test/file1
/test/file2
/test/dr/file1
In this mothod the textFileStream can only read file:
/test/file1
/test/file2
/test/dr/

but the file: /test/dr/file1  is not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3587) Spark SQL can't support lead() over() window function

2014-09-18 Thread caoli (JIRA)

caoli created SPARK-3587:


 Summary: Spark SQL can't support lead() over() window function
 Key: SPARK-3587
 URL: https://issues.apache.org/jira/browse/SPARK-3587
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.0.2
 Environment: operator system is suse 11.3, three node ,every node has 
48GB MEM and 16 CPU
Reporter: caoli


   When  run the following SQL:
select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as 
start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) as 
end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by 
c.pastkkdate) as end_time from bay_car_test1 c

 in HIVE, it is OK,

-but run in Spark SQL with HiveContext, it is error with:

java.lang.RuntimeException: 
Unsupported language features in query: select c.plateno, c.platetype,c.bay_id 
as start_bay_id,c.pastkkdate as start_time,lead(c.bay_id) over(partition by 
c.plateno order by c.pastkkdate) as end_bay_id, lead(c.pastkkdate) 
over(partition by c.plateno order by c.pastkkdate) as end_time from 
bay_car_test1 c
TOK_QUERY


is Spark SQLcan't support lead() and over() function?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3588) Gaussian Mixture Model clustering

Meethu Mathew created SPARK-3588:


 Summary: Gaussian Mixture Model clustering
 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew


Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
models the entire data set as a finite mixture of Gaussian distributions,each 
parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
π. In this technique, probability of  each point to belong to each cluster is 
computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark 
where the parameters are estimated using the  Expectation-Maximization 
algorithm.Our current implementation considers diagonal covariance matrix for 
each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup 
where each node config is(8 Cores,8 GB RAM) and the spark version used is 
1.0.0. We also evaluated python version of k-means available in spark on the 
same datasets.
Below are the results from this benchmark study. The reported stats are average 
from 10 runs.Tests were done on multiple datasets with varying number of 
features and instances.

||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
 mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
 

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
time per iteration |Time for 100 iterations | 

|0.7million| nbsp;nbsp;nbsp;13 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp; 
nbsp;nbsp; 13s  nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;|

|1.8million| nbsp;nbsp;nbsp;11 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
nbsp;nbsp;nbsp;  |

|10million|nbsp;nbsp;nbsp;16 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;
| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering


 [ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-3588:
-
Description: 
Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
models the entire data set as a finite mixture of Gaussian distributions,each 
parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
π. In this technique, probability of  each point to belong to each cluster is 
computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark 
where the parameters are estimated using the  Expectation-Maximization 
algorithm.Our current implementation considers diagonal covariance matrix for 
each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup 
where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We 
also evaluated python version of k-means available in spark on the same 
datasets.
Below are the results from this benchmark study. The reported stats are average 
from 10 runs.Tests were done on multiple datasets with varying number of 
features and instances.

||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
 mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
 

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
time per iteration |Time for 100 iterations | 

|0.7million| nbsp;nbsp;nbsp;13 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp; 
nbsp;nbsp; 13s  nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;|

|1.8million| nbsp;nbsp;nbsp;11 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
nbsp;nbsp;nbsp;  |

|10million|nbsp;nbsp;nbsp;16 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;
| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|

  was:
Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
models the entire data set as a finite mixture of Gaussian distributions,each 
parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
π. In this technique, probability of  each point to belong to each cluster is 
computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark 
where the parameters are estimated using the  Expectation-Maximization 
algorithm.Our current implementation considers diagonal covariance matrix for 
each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup 
where each node config is(8 Cores,8 GB RAM) and the spark version used is 
1.0.0. We also evaluated python version of k-means available in spark on the 
same datasets.
Below are the results from this benchmark study. The reported stats are average 
from 10 runs.Tests were done on multiple datasets with varying number of 
features and instances.

||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
 mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
 

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
time per iteration |Time for 100 iterations | 

|0.7million| nbsp;nbsp;nbsp;13 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp; 
nbsp;nbsp; 13s  nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;|

|1.8million| nbsp;nbsp;nbsp;11 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; |

[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering


 [ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meethu Mathew updated SPARK-3588:
-
Attachment: GMMSpark.py

 Gaussian Mixture Model clustering
 -

 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew
 Attachments: GMMSpark.py


 Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
 models the entire data set as a finite mixture of Gaussian distributions,each 
 parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
 π. In this technique, probability of  each point to belong to each cluster is 
 computed along with the cluster statistics.
 We have come up with an initial distributed implementation of GMM in pyspark 
 where the parameters are estimated using the  Expectation-Maximization 
 algorithm.Our current implementation considers diagonal covariance matrix for 
 each component.
 We did an initial benchmark study on a  2 node Spark standalone cluster setup 
 where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
 We also evaluated python version of k-means available in spark on the same 
 datasets.
 Below are the results from this benchmark study. The reported stats are 
 average from 10 runs.Tests were done on multiple datasets with varying number 
 of features and instances.
 ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
  mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
  
 |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
 time per iteration |Time for 100 iterations | 
 |0.7million| nbsp;nbsp;nbsp;13 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  nbsp;nbsp;nbsp;nbsp;26min 
 nbsp;nbsp;nbsp;|
 |1.8million| nbsp;nbsp;nbsp;11 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
 nbsp;nbsp;nbsp;  |
 |10million|nbsp;nbsp;nbsp;16 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;   
  | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| 
  nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   
  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering


[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138782#comment-14138782
 ] 

Meethu Mathew commented on SPARK-3588:
--

We are interested in contributing this implementation as a patch to mllib. 
Could you please review and suggest how to take this forward?

 Gaussian Mixture Model clustering
 -

 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew
 Attachments: GMMSpark.py


 Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
 models the entire data set as a finite mixture of Gaussian distributions,each 
 parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
 π. In this technique, probability of  each point to belong to each cluster is 
 computed along with the cluster statistics.
 We have come up with an initial distributed implementation of GMM in pyspark 
 where the parameters are estimated using the  Expectation-Maximization 
 algorithm.Our current implementation considers diagonal covariance matrix for 
 each component.
 We did an initial benchmark study on a  2 node Spark standalone cluster setup 
 where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
 We also evaluated python version of k-means available in spark on the same 
 datasets.
 Below are the results from this benchmark study. The reported stats are 
 average from 10 runs.Tests were done on multiple datasets with varying number 
 of features and instances.
 ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
  mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
  
 |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
 time per iteration |Time for 100 iterations | 
 |0.7million| nbsp;nbsp;nbsp;13 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  nbsp;nbsp;nbsp;nbsp;26min 
 nbsp;nbsp;nbsp;|
 |1.8million| nbsp;nbsp;nbsp;11 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
 nbsp;nbsp;nbsp;  |
 |10million|nbsp;nbsp;nbsp;16 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;   
  | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| 
  nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   
  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2175) Null values when using App trait.

2014-09-18 Thread Philip Wills (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138803#comment-14138803
 ] 

Philip Wills commented on SPARK-2175:
-

Whilst the workaround for this is trivial, discovering it's the problem isn't. 
It would definitely reduce the amount of swearing involved in running a first 
app if there was some way this could be detected and warned/errored on.

 Null values when using App trait.
 -

 Key: SPARK-2175
 URL: https://issues.apache.org/jira/browse/SPARK-2175
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Linux
Reporter: Brandon Amos
Priority: Trivial

 See 
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3589) [Minor]Remove redundant code in deploy module

2014-09-18 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-3589:
--

 Summary: [Minor]Remove redundant code in deploy module
 Key: SPARK-3589
 URL: https://issues.apache.org/jira/browse/SPARK-3589
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


export CLASSPATH in spark-class is redundant.

We could reuse value isYarnCluster in SparkSubmit.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3589) [Minor]Remove redundant code in deploy module


[ 
https://issues.apache.org/jira/browse/SPARK-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138821#comment-14138821
 ] 

Apache Spark commented on SPARK-3589:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2445

 [Minor]Remove redundant code in deploy module
 -

 Key: SPARK-3589
 URL: https://issues.apache.org/jira/browse/SPARK-3589
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 export CLASSPATH in spark-class is redundant.
 We could reuse value isYarnCluster in SparkSubmit.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-18 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Thank you, your answers are really helpful. Should I submit this issue to 
OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java 
(https://github.com/fommil/netlib-java)? I thought the latter has jni 
implementation. I it ok to submit it as is?

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3321) Defining a class within python main script


[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138867#comment-14138867
 ] 

Matthew Farrellee commented on SPARK-3321:
--

[~guoxu1231] i think so too. ok if i close this?

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at

[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper


[ 
https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138873#comment-14138873
 ] 

mohan gaddam commented on SPARK-3447:
-

I am also facing the same issue with spark streaming API, kryo serializer and 
Avro Objects.
I have observed this behavior with output operations like print, collect etc. 
also observed that if the avro object is simple, no problem. but if the avro 
objects are complex with unions/Arrays, gives the exception.

find the stack trace below:
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException
Serialization trace:
value (com.globallogic.goliath.model.Datum)
data (com.globallogic.goliath.model.ResourceMessage)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


below is the avro message.
{version: 01, sequence: 1, resource: sensor-001, controller: 
002, controllerTimestamp: 1411038710358, data: {value: [{name: 
Temperature, value: 30}, {name: Speed, value: 60}, {name: 
Location, value: [+401213.1, -0750015.1]}, {name: Timestamp, 
value: 2014-09-09T08:15:25-05:00}]}}

message is been successfully decoded in decoder, but throws exception for 
output operation.

 Kryo NPE when serializing JListWrapper
 --

 Key: SPARK-3447
 URL: https://issues.apache.org/jira/browse/SPARK-3447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Repro (provided by [~davies]):
 {code}
 from pyspark.sql import SQLContext; 
 SQLContext(sc).inferSchema(sc.parallelize([{a: 
 [3]}]))._jschema_rdd.collect()
 {code}
 {code}
 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 underlying (scala.collection.convert.Wrappers$JListWrapper)
 values (org.apache.spark.sql.catalyst.expressions.GenericRow)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at

[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper

2014-09-18 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138884#comment-14138884
 ] 

Yin Huai commented on SPARK-3447:
-

[~mohan.gadm] From the trace, seems the NPE was caused by 
{code}
value (com.globallogic.goliath.model.Datum)
data (com.globallogic.goliath.model.ResourceMessage)
{code}
Can you check these two classes (I can not find them online)? 
Seems the avro message was decoded and then you stored the message as a 
com.globallogic.goliath.model.ResourceMessage. In a ResourceMessage, an array 
was repreented by Datum, which caused the problem.

 Kryo NPE when serializing JListWrapper
 --

 Key: SPARK-3447
 URL: https://issues.apache.org/jira/browse/SPARK-3447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Repro (provided by [~davies]):
 {code}
 from pyspark.sql import SQLContext; 
 SQLContext(sc).inferSchema(sc.parallelize([{a: 
 [3]}]))._jschema_rdd.collect()
 {code}
 {code}
 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 underlying (scala.collection.convert.Wrappers$JListWrapper)
 values (org.apache.spark.sql.catalyst.expressions.GenericRow)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 Caused by: java.lang.NullPointerException
 at 
 scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 ... 23 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper


[ 
https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138899#comment-14138899
 ] 

mohan gaddam commented on SPARK-3447:
-

sorry for the mistake, those are the project specific classes. and those are 
the classes generated by Avro itself from an avro avdl file and available in 
driver jar and also worker nodes.

 Kryo NPE when serializing JListWrapper
 --

 Key: SPARK-3447
 URL: https://issues.apache.org/jira/browse/SPARK-3447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Repro (provided by [~davies]):
 {code}
 from pyspark.sql import SQLContext; 
 SQLContext(sc).inferSchema(sc.parallelize([{a: 
 [3]}]))._jschema_rdd.collect()
 {code}
 {code}
 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 underlying (scala.collection.convert.Wrappers$JListWrapper)
 values (org.apache.spark.sql.catalyst.expressions.GenericRow)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 Caused by: java.lang.NullPointerException
 at 
 scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 ... 23 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-18 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138906#comment-14138906
 ] 

Helena Edelson commented on SPARK-2593:
---

[~matei] +1 for spark streaming, that is a primary concern here. And I 
understand your concern over support for akka upgrades. However I am more than 
happy to help WRT that and am sure I can find a few others that feel the same 
so that time isn't taken away from your team's new features/enhancements 
bandwidth. I will get more data on have 2 actor systems on a node.

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper


[ 
https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138911#comment-14138911
 ] 

mohan gaddam commented on SPARK-3447:
-

record KeyValueObject {
union{boolean, int, long, float, double, bytes, string} name; 
//can be of any one of them mentioned in union.
union {boolean, int, long, float, double, bytes, string, 
arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, 
KeyValueObject} value;
}
record Datum {
union {boolean, int, long, float, double, bytes, string, 
arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, 
KeyValueObject} value;
}
record ResourceMessage {
string version;
string sequence;
string resource;
string controller;
string controllerTimestamp;
union {Datum, arrayDatum} data;
}

 Kryo NPE when serializing JListWrapper
 --

 Key: SPARK-3447
 URL: https://issues.apache.org/jira/browse/SPARK-3447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.2.0


 Repro (provided by [~davies]):
 {code}
 from pyspark.sql import SQLContext; 
 SQLContext(sc).inferSchema(sc.parallelize([{a: 
 [3]}]))._jschema_rdd.collect()
 {code}
 {code}
 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 underlying (scala.collection.convert.Wrappers$JListWrapper)
 values (org.apache.spark.sql.catalyst.expressions.GenericRow)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 Caused by: java.lang.NullPointerException
 at 
 scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 at 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 ... 23 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3447) Kryo NPE when serializing JListWrapper


[ 
https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138873#comment-14138873
 ] 

mohan gaddam edited comment on SPARK-3447 at 9/18/14 1:21 PM:
--

I am also facing the same issue with spark streaming API, kryo serializer and 
Avro Objects.
I have observed this behavior with output operations like print, collect etc. 
also observed that if the avro object is simple, no problem. but if the avro 
objects are complex with unions/Arrays, gives the exception.

find the stack trace below:
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException
Serialization trace:
value (xyz.Datum)
data (xyz.ResourceMessage)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


below is the avro message.
{version: 01, sequence: 1, resource: sensor-001, controller: 
002, controllerTimestamp: 1411038710358, data: {value: [{name: 
Temperature, value: 30}, {name: Speed, value: 60}, {name: 
Location, value: [+401213.1, -0750015.1]}, {name: Timestamp, 
value: 2014-09-09T08:15:25-05:00}]}}

message is been successfully decoded in decoder, but throws exception for 
output operation.


was (Author: mohan.gadm):
I am also facing the same issue with spark streaming API, kryo serializer and 
Avro Objects.
I have observed this behavior with output operations like print, collect etc. 
also observed that if the avro object is simple, no problem. but if the avro 
objects are complex with unions/Arrays, gives the exception.

find the stack trace below:
ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: com.esotericsoftware.kryo.KryoException: 
java.lang.NullPointerException
Serialization trace:
value (com.globallogic.goliath.model.Datum)
data (com.globallogic.goliath.model.ResourceMessage)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at

[jira] [Commented] (SPARK-1987) More memory-efficient graph construction


[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138970#comment-14138970
 ] 

Apache Spark commented on SPARK-1987:
-

User 'larryxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/2446

 More memory-efficient graph construction
 

 Key: SPARK-1987
 URL: https://issues.apache.org/jira/browse/SPARK-1987
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 A graph's edges are usually the largest component of the graph. GraphX 
 currently stores edges in parallel primitive arrays, so each edge should only 
 take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
 current implementation in EdgePartitionBuilder uses an array of Edge objects 
 as an intermediate representation for sorting, so each edge additionally 
 takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
 (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
 unnecessarily increases GraphX's memory requirements by a factor of 3.
 To save memory, EdgePartitionBuilder should instead use a custom sort routine 
 that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3557) Yarn client config prioritization is backwards


 [ 
https://issues.apache.org/jira/browse/SPARK-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3557.
--
Resolution: Duplicate

 Yarn client config prioritization is backwards
 --

 Key: SPARK-3557
 URL: https://issues.apache.org/jira/browse/SPARK-3557
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 In YarnClientSchedulerBackend, we have:
 {code}
 if (System.getenv(envVar) != null) {
   arrayBuf += (optionName, System.getenv(envVar))
 } else if (sc.getConf.contains(sysProp)) {
   arrayBuf += (optionName, sc.getConf.get(sysProp))
 }
 {code}
 Elsewhere in Spark we try to honor Spark configs over environment variables. 
 This was introduced as a fix for the Yarn app name (SPARK-1631), but this 
 also changed the behavior for other configs. Perhaps we should special case 
 this particular config and correct the prioritization order of the other 
 configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3557) Yarn client config prioritization is backwards


[ 
https://issues.apache.org/jira/browse/SPARK-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138993#comment-14138993
 ] 

Thomas Graves commented on SPARK-3557:
--

This is a dup of SPARK-2872, although this has better description

 Yarn client config prioritization is backwards
 --

 Key: SPARK-3557
 URL: https://issues.apache.org/jira/browse/SPARK-3557
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 In YarnClientSchedulerBackend, we have:
 {code}
 if (System.getenv(envVar) != null) {
   arrayBuf += (optionName, System.getenv(envVar))
 } else if (sc.getConf.contains(sysProp)) {
   arrayBuf += (optionName, sc.getConf.get(sysProp))
 }
 {code}
 Elsewhere in Spark we try to honor Spark configs over environment variables. 
 This was introduced as a fix for the Yarn app name (SPARK-1631), but this 
 also changed the behavior for other configs. Perhaps we should special case 
 this particular config and correct the prioritization order of the other 
 configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend


[ 
https://issues.apache.org/jira/browse/SPARK-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138996#comment-14138996
 ] 

Thomas Graves commented on SPARK-2872:
--

adding description from spark-3557 as it explains it better:

n YarnClientSchedulerBackend, we have:
if (System.getenv(envVar) != null) {
  arrayBuf += (optionName, System.getenv(envVar))
} else if (sc.getConf.contains(sysProp)) {
  arrayBuf += (optionName, sc.getConf.get(sysProp))
}
Elsewhere in Spark we try to honor Spark configs over environment variables. 
This was introduced as a fix for the Yarn app name (SPARK-1631), but this also 
changed the behavior for other configs. Perhaps we should special case this 
particular config and correct the prioritization order of the other configs.

 Fix conflict between code and doc in YarnClientSchedulerBackend
 ---

 Key: SPARK-2872
 URL: https://issues.apache.org/jira/browse/SPARK-2872
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Zhihui

 Doc say: system properties override environment variables.
 https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71
 But code is conflict with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139010#comment-14139010
 ] 

Apache Spark commented on SPARK-3389:
-

User 'patmcdonough' has created a pull request for this issue:
https://github.com/apache/spark/pull/2447

 Add converter class to make reading Parquet files easy with PySpark
 ---

 Key: SPARK-3389
 URL: https://issues.apache.org/jira/browse/SPARK-3389
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Uri Laserson

 If a user wants to read Parquet data from PySpark, they currently must use 
 SparkContext.newAPIHadoopFile.  If they do not provide a valueConverter, they 
 will get JSON string that must be parsed.  Here I add a Converter 
 implementation based on the one in the AvroConverters.scala file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages


[ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139043#comment-14139043
 ] 

Matthew Farrellee commented on SPARK-3580:
--

what do you think about going the other direction, adding a partitions property 
to RDDs in python?

given that an RDD is a list of partitions, a function for computing each split, 
a list of deps on other RDDs, etc, it makes sense that you could access a 
someRDD.partitions, and doing so looks to be the preferred method in scala. so, 
instead of a someRDD.getNumPartitions(), python code could use a more idiomatic 
len(someRDD.partitions).

 Add Consistent Method To Get Number of RDD Partitions Across Different 
 Languages
 

 Key: SPARK-3580
 URL: https://issues.apache.org/jira/browse/SPARK-3580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Pat McDonough
  Labels: starter

 Programmatically retrieving the number of partitions is not consistent 
 between python and scala. A consistent method should be defined and made 
 public across both languages.
 RDD.partitions.size is also used quite frequently throughout the internal 
 code, so that might be worth refactoring as well once the new method is 
 available.
 What we have today is below.
 In Scala:
 {code}
 scala someRDD.partitions.size
 res0: Int = 30
 {code}
 In Python:
 {code}
 In [2]: someRDD.getNumPartitions()
 Out[2]: 30
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-18 Thread Gino Bustelo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139061#comment-14139061
 ] 

Gino Bustelo commented on SPARK-2892:
-

Any update on this? Will it get fixed for 1.0.3 or 1.0.1

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-18 Thread Gino Bustelo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139061#comment-14139061
 ] 

Gino Bustelo edited comment on SPARK-2892 at 9/18/14 3:30 PM:
--

Any update on this? Will it get fixed for 1.0.3 or 1.1.0


was (Author: lbustelo):
Any update on this? Will it get fixed for 1.0.3 or 1.0.1

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3270) Spark API for Application Extensions


 [ 
https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3270:
-
Issue Type: New Feature  (was: Improvement)

 Spark API for Application Extensions
 

 Key: SPARK-3270
 URL: https://issues.apache.org/jira/browse/SPARK-3270
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Michal Malohlava

 Any application should be able to enrich spark infrastructure by services 
 which are not available by default.  
 Hence, to support such application extensions (aka extesions/plugins) 
 Spark platform should provide:
   - an API to register an extension 
   - an API to register a service (meaning provided functionality)
   - well-defined points in Spark infrastructure which can be enriched/hooked 
 by extension
   - a way of deploying extension (for example, simply putting the extension 
 on classpath and using Java service interface)
   - a way to access extension from application
 Overall proposal is available here: 
 https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing
 Note: In this context, I do not mean reinventing OSGi (or another plugin 
 platform) but it can serve as a good starting point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-09-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-1576.
---
Resolution: Not a Problem

spark-submit already supports this with existing options.

 Passing of JAVA_OPTS to YARN on command line
 

 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 0.9.1, 0.9.2
Reporter: Nishkam Ravi
 Attachments: SPARK-1576.patch


 JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
 or as config vars (after Patrick's recent change). It would be good to allow 
 the user to pass them on command line as well to restrict scope to single 
 application invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-18 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139163#comment-14139163
 ] 

Matei Zaharia commented on SPARK-2593:
--

Sure, it would be great to do this for streaming. For the other stuff by the 
way, my concern isn't so much upgrades, it's locking users in to a specific 
Spark version. For example, suppose we add an API involving classes from Akka 
2.2 today, and later 2.3 or 2.4 changes those classes in a way that isn't 
compatible. Then to meet our API stability promises to our users, we're going 
to keep our version of Akka at 2.2, and users simply won't be able to use a new 
Akka in the same application as Spark. With Akka in particular you sometimes 
have to upgrade in order to use a new Scala version too, which is another 
reason we can't lock it into our API. Basically in these cases it's just very 
risky to expose fast-moving third-party classes if you want to have a stable 
API. We've been bitten by exposing stuff as mundane as Guava or Google 
Protobufs (!) because of incompatible changes in minor versions. We care a lot 
about API stability within Spark and in particular about shielding our users 
from the fast-moving APIs in distributed systems land.

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3547) Maybe we should not simply make return code 1 equal to CLASS_NOT_FOUND


 [ 
https://issues.apache.org/jira/browse/SPARK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3547.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: WangTaoTheTonic

Resolved by: https://github.com/apache/spark/pull/2421

 Maybe we should not simply make return code 1 equal to CLASS_NOT_FOUND
 --

 Key: SPARK-3547
 URL: https://issues.apache.org/jira/browse/SPARK-3547
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.2.0


 It incurred runtime exception when hadoop version is not A.B.* format, which 
 is detected by Hive. Then the jvm return code is 1, while equals to 
 CLASS_NOT_FOUND_EXIT_STATUS in start-thriftserver.sh script. It proves even 
 runtime exception can lead the jvm existed with code 1.
 Should we just modify the misleading error message in script ?
 The error message in script:
 CLASS_NOT_FOUND_EXIT_STATUS=1
 if [[ exit_status -eq CLASS_NOT_FOUND_EXIT_STATUS ]]; then
   echo
   echo Failed to load Hive Thrift server main class $CLASS.
   echo You need to build Spark with -Phive.
 fi
 Below is exception stack I met:
 [omm@dc1-rack1-host2 sbin]$ ./start-thriftserver.sh 
 log4j:WARN No appenders could be found for logger 
 (org.apache.hadoop.util.Shell).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
 java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 (expected 
 A.B.* format)
 at 
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:286)
 at 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:54)
 at 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:332)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:79)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.RuntimeException: java.lang.RuntimeException: Illegal Hadoop 
 Version: V100R001C00 (expected A.B.* format)
 at 
 org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:368)
 at 
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:278)
 ... 9 more
 Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Illegal 
 Hadoop Version: V100R001C00 (expected A.B.* format)
 at 
 org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:53)
 at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at 
 org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:365)
 ... 10 more
 Caused by: java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 
 (expected A.B.* format)
 at 
 org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:141)
 at 
 org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:113)
 at 
 org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:80)
 at 
 org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51)
 ... 13 more
 Failed to load Hive Thrift server main class 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
 You need to build Spark with -Phive.
 I tested runtime exception and ioexception today, and JVM will also return 
 with exit code 1. Below is my code and error it lead.
 Code, throw ArrayIndexOutOfBoundsException and FileNotFoundException:
 object ExitCodeWithRuntimeException
 {
   def main(args: Array[String]): Unit =
   {
 if(args(0).equals(array))
   arrayIndexOutOfBoundsException(args)
 else if(args(0).equals(file))
   fileNotFoundException()
   }
   def arrayIndexOutOfBoundsException(args: Array[String]): Unit =
   {

[jira] [Resolved] (SPARK-3579) Jekyll doc generation is different across environments


 [ 
https://issues.apache.org/jira/browse/SPARK-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3579.

   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2443
[https://github.com/apache/spark/pull/2443]

 Jekyll doc generation is different across environments
 --

 Key: SPARK-3579
 URL: https://issues.apache.org/jira/browse/SPARK-3579
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.2.0


 This can result in a lot of false changes when someone alters something with 
 the docs. It is relevant to the both the Spark website (maintained in a 
 separate subversion repo) and the Spark docs.
 There are at least two issues here. One is that the HTML character escaping 
 can be different in certain cases. Another is that the highlighting output 
 seems a bit different depending on (I think) what version of pygments is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1477) Add the lifecycle interface


 [ 
https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1477.

Resolution: Won't Fix

Unless we are planning to interact with these components in a generic way, 
there is not a good reason to add this interface (see relevant discussion in 
JIRA).

 Add the lifecycle interface
 ---

 Key: SPARK-1477
 URL: https://issues.apache.org/jira/browse/SPARK-1477
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 Now the Spark in the code, there are a lot of interface or class  defines the 
 stop and start 
 method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]
  . we should use a life cycle interface improve the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3593) Support Sorting of Binary Type Data

2014-09-18 Thread Paul Magid (JIRA)

Paul Magid created SPARK-3593:
-

 Summary: Support Sorting of Binary Type Data
 Key: SPARK-3593
 URL: https://issues.apache.org/jira/browse/SPARK-3593
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Paul Magid


If you try sorting on a binary field you currently get an exception.   Please 
add support for binary data type sorting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1477) Add the lifecycle interface


[ 
https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139218#comment-14139218
 ] 

Patrick Wendell edited comment on SPARK-1477 at 9/18/14 5:34 PM:
-

Unless we are planning to interact with these components in a generic way, 
there is not a good reason to add this interface (see relevant discussion in 
github).


was (Author: pwendell):
Unless we are planning to interact with these components in a generic way, 
there is not a good reason to add this interface (see relevant discussion in 
JIRA).

 Add the lifecycle interface
 ---

 Key: SPARK-1477
 URL: https://issues.apache.org/jira/browse/SPARK-1477
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 Now the Spark in the code, there are a lot of interface or class  defines the 
 stop and start 
 method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]
  . we should use a life cycle interface improve the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-18 Thread Li Pu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139233#comment-14139233
 ] 

Li Pu commented on SPARK-3530:
--

Nice design doc! I had some experiences on the parameter part. It would be 
great to have Constraints on the individual parameters and on the Params level. 
For example, learning_rate must be greater than 0, and regularization can be 
one of l1, l2. Parameter check is something that every learning algorithm 
does, so some support at parameter definition time would make code more 
concise. 

abstract class ParamConstraint[T] extends Serializable {
  def isValid(value: T): Boolean
  def invalidMessage(value: T): String
}

class IntRangeConstraint(min: Int, max: Int) extends ParamConstraint[Int] {
  def isValid(value: Int) = value = min  value = max
  def invalidMessage(value: T) = ...
}

class Param[T] (..., constraints: List[ParamConstraint[T]] = List()) {...} // 
constraints is a list because there might be more than one type of constraints 
applied to this Param

at definition time, we can write:
val maxIter: Param[Int] = new Param(id, “maxIter”, “max number of iterations”, 
100, List(new IntRangeConstraint(1, 500)))

There shouldn't be too many types of constraints, so ml could provide a list of 
commonly used constraint classes. Keeping parameter definition and constraint 
in the same line also improves readability. Params trait could use similar 
structure to check constraints on multiple parameters, but this is less likely 
to happen in real use cases. In the end, validateParams of Params just call 
isValid of all member Param as default implementation. 

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3592) applySchema to an RDD of Row


[ 
https://issues.apache.org/jira/browse/SPARK-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139299#comment-14139299
 ] 

Apache Spark commented on SPARK-3592:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2448

 applySchema to an RDD of Row
 

 Key: SPARK-3592
 URL: https://issues.apache.org/jira/browse/SPARK-3592
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 Right now, we can not appy schema to a RDD of Row, this should be a Bug,
 {code}
  srdd = sqlCtx.jsonRDD(sc.parallelize([{a:2}]))
  sqlCtx.applySchema(srdd.map(lambda x:x), srdd.schema())
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/daviesliu/work/spark/python/pyspark/sql.py, line 1121,
 in applySchema
 _verify_type(row, schema)
   File /Users/daviesliu/work/spark/python/pyspark/sql.py, line 736,
 in _verify_type
 % (dataType, type(obj)))
 TypeError: StructType(List(StructField(a,IntegerType,true))) can not
 accept abject in type class 'pyspark.sql.Row'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3560) In yarn-cluster mode, jars are distributed through multiple mechanisms.


[ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139339#comment-14139339
 ] 

Apache Spark commented on SPARK-3560:
-

User 'Victsm' has created a pull request for this issue:
https://github.com/apache/spark/pull/2449

 In yarn-cluster mode, jars are distributed through multiple mechanisms.
 ---

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Priority: Critical

 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3566) .gitignore and .rat-excludes should consider Windows cmd file and Emacs' backup files


 [ 
https://issues.apache.org/jira/browse/SPARK-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3566.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: Kousuke Saruta
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

https://github.com/apache/spark/pull/2426

 .gitignore and .rat-excludes should consider Windows cmd file and Emacs' 
 backup files
 -

 Key: SPARK-3566
 URL: https://issues.apache.org/jira/browse/SPARK-3566
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.2.0


 Current .gitignore and .rat-excludes does not consider spark-env.cmd.
 Also, .gitignore doesn't consider emacs' meta files (backup file which starts 
 with and ends with # and lock file which starts with .#) even though 
 considers vi's meta file (*.swp).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3589) [Minor]Remove redundant code in deploy module


 [ 
https://issues.apache.org/jira/browse/SPARK-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3589.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: WangTaoTheTonic

 [Minor]Remove redundant code in deploy module
 -

 Key: SPARK-3589
 URL: https://issues.apache.org/jira/browse/SPARK-3589
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 export CLASSPATH in spark-class is redundant.
 We could reuse value isYarnCluster in SparkSubmit.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-09-18 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139396#comment-14139396
 ] 

Zhan Zhang commented on SPARK-1537:
---

Do you have any update on this, or any schedule in your mind yet?

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.

2014-09-18 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3560:
--
Summary: In yarn-cluster mode, the same jars are distributed through 
multiple mechanisms.  (was: In yarn-cluster mode, jars are distributed through 
multiple mechanisms.)

 In yarn-cluster mode, the same jars are distributed through multiple 
 mechanisms.
 

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Priority: Critical

 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile

2014-09-18 Thread Ian Hummel (JIRA)

Ian Hummel created SPARK-3595:
-

 Summary: Spark should respect configured OutputCommitter when 
using saveAsHadoopFile
 Key: SPARK-3595
 URL: https://issues.apache.org/jira/browse/SPARK-3595
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Ian Hummel


When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be a 
{{FileOutputCommitter}}.

When using Spark on an EMR cluster to process and write files to/from S3, the 
default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid 
writing to a temporary directory and doing a copy.

Will submit a patch via GitHub shortly.

Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)


[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139461#comment-14139461
 ] 

Xiangrui Meng commented on SPARK-3403:
--

Sorry, it should be netlib-java, but the real cause may hide in JNI. Sam 
(@fommil) definitely knows more about this.

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.2.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3596) Support changing the yarn client monitor interval

Thomas Graves created SPARK-3596:


 Summary: Support changing the yarn client monitor interval 
 Key: SPARK-3596
 URL: https://issues.apache.org/jira/browse/SPARK-3596
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves


Right now spark on yarn has a monitor interval that can be configured by 
spark.yarn.report.interval.  This is how often the client checks with the RM to 
get status on the running application in cluster mode.   We should allow users 
to set this interval as some may not need to check so often.   There is another 
jira filed to make it so the client doesn't have to stay around for cluster 
mode.

With the changes in https://github.com/apache/spark/pull/2350, it further 
extends that to affect client mode. 

We may want to add in specific configs for that since the monitorApplication 
function is now used in multiple different scenarios it actually might make 
sense for it to take the timeout as a parameter. You could want different 
timeout for different situations.

for instance how quickly we poll on client side and print information (cluster 
mode) vs how quickly we recognize the application quit and we want to terminate 
(client mode), I want the latter to happen quickly where as in cluster mode I 
might not care as much about how often it is printing updated info to the 
screen. I guess its private so we could leave it as is and change if we add 
support for that later.

my suggestion for name would be something like 
spark.yarn.client.progress.pollinterval. If we were to add separate ones in the 
future then they could be something like spark.yarn.app.ready.pollinterval and 
spark.yarn.app.completion.pollinterval 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile


[ 
https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139509#comment-14139509
 ] 

Apache Spark commented on SPARK-3595:
-

User 'themodernlife' has created a pull request for this issue:
https://github.com/apache/spark/pull/2450

 Spark should respect configured OutputCommitter when using saveAsHadoopFile
 ---

 Key: SPARK-3595
 URL: https://issues.apache.org/jira/browse/SPARK-3595
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Ian Hummel

 When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be 
 a {{FileOutputCommitter}}.
 When using Spark on an EMR cluster to process and write files to/from S3, the 
 default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid 
 writing to a temporary directory and doing a copy.
 Will submit a patch via GitHub shortly.
 Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

[
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139549#comment-14139549
]

Apache Spark commented on SPARK-1486:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/2451

Support multi-model training in MLlib
-

Key: SPARK-1486
URL: https://issues.apache.org/jira/browse/SPARK-1486
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical

It is rare in practice to train just one model with a given set of
parameters. Usually, this is done by training multiple models with different
sets of parameters and then select the best based on their performance on the
validation set. MLlib should provide native support for multi-model
training/scoring. It requires decoupling of concepts like problem,
formulation, algorithm, parameter set, and model, which are missing in MLlib
now. MLI implements similar concepts, which we can borrow. There are
different approaches for multi-model training:
0) Keep one copy of the data, and train models one after another (or maybe in
parallel, depending on the scheduler).
1) Keep one copy of the data, and train multiple models at the same time
(similar to `runs` in KMeans).
2) Make multiple copies of the data (still stored distributively), and use
more cores to distribute the work.
3) Collect the data, make the entire dataset available on workers, and train
one or more models on each worker.
Users should be able to choose which execution mode they want to use. Note
that 3) could cover many use cases in practice when the training data is not
huge, e.g., 1GB.
This task will be divided into sub-tasks and this JIRA is created to discuss
the design and track the overall progress.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES


 [ 
https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3340:
-
Assignee: (was: Andrew Or)

 Deprecate ADD_JARS and ADD_FILES
 

 Key: SPARK-3340
 URL: https://issues.apache.org/jira/browse/SPARK-3340
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 These were introduced before Spark submit even existed. Now that there are 
 many better ways of setting jars and python files through Spark submit, we 
 should deprecate these environment variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

[
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139600#comment-14139600
]

Xiangrui Meng commented on SPARK-3530:
--

[~eustache] The default implementation of multi-model training will be a for
loop. But the API leaves space for future optimizations, like group weight
vectors and using level-3 BLAS for better performance. It shouldn't be a meta
class, because many optimizations are specific. For example, LASSO can be
solved via LARS, which computes a full solution path for all regularization
parameters. The level-3 BLAS optimization is another example, which can give 8x
speedup (SPARK-1486).

[~vrilleup] We can have a set of built-in preconditions, like positivity. Or we
could accept lambda function for assertions (T) = Unit, which may be hard for
Java users but they should be familiar of creating those in Spark.

Pipeline and Parameters
---

Key: SPARK-3530
URL: https://issues.apache.org/jira/browse/SPARK-3530
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

This part of the design doc is for pipelines and parameters. I put the design
doc at
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
I will copy the proposed interfaces to this JIRA later. Some sample code can
be viewed at: https://github.com/mengxr/spark-ml/
Please help review the design and post your comments here. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3530) Pipeline and Parameters

[
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139600#comment-14139600
]

Xiangrui Meng edited comment on SPARK-3530 at 9/18/14 10:06 PM:

[~eustache] The default implementation of multi-model training will be a for
loop. But the API leaves space for future optimizations, like grouping weight
vectors and using level-3 BLAS for better performance. It shouldn't be a meta
class, because many optimizations are specific. For example, LASSO can be
solved via LARS, which computes a full solution path for all regularization
parameters. The level-3 BLAS optimization is another example, which can give 8x
speedup (SPARK-1486).

was (Author: mengxr):
[~eustache] The default implementation of multi-model training will be a for
loop. But the API leaves space for future optimizations, like group weight
vectors and using level-3 BLAS for better performance. It shouldn't be a meta
class, because many optimizations are specific. For example, LASSO can be
solved via LARS, which computes a full solution path for all regularization
parameters. The level-3 BLAS optimization is another example, which can give 8x
speedup (SPARK-1486).

Pipeline and Parameters
---

Key: SPARK-3530
URL: https://issues.apache.org/jira/browse/SPARK-3530
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3573) Dataset


 [ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3573:
-
Description: 
This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
ML-specific metadata embedded in its schema.

.Sample code

Suppose we have training events stored on HDFS and user/ad features in Hive, we 
want to assemble features for training and then apply decision tree.
The proposed pipeline with dataset looks like the following (need more 
refinements):

{code}
sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event)
val training = sqlContext.sql(
  SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
event.action AS label,
 user.gender AS userGender, user.country AS userCountry, user.features 
AS userFeatures,
 ad.targetGender AS targetGender
FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
ad.id;).cache()

val indexer = new Indexer()
val interactor = new Interactor()
val fvAssembler = new FeatureVectorAssembler()
val treeClassifer = new DecisionTreeClassifer()

val paramMap = new ParamMap()
  .put(indexer.features, Map(userCountryIndex - userCountry))
  .put(indexer.sortByFrequency, true)
  .put(iteractor.features, Map(genderMatch - Array(userGender, 
targetGender)))
  .put(fvAssembler.features, Map(features - Array(genderMatch, 
userCountryIndex, userFeatures)))
  .put(fvAssembler.dense, true)
  .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
features and label columns.

val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
val model = pipeline.fit(raw, paramMap)

sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
val test = sqlContext.sql(
  SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
 user.gender AS userGender, user.country AS userCountry, user.features 
AS userFeatures,
 ad.targetGender AS targetGender
FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
ad.id;)

val prediction = model.transform(test).select('eventId, 'prediction)
{code}

  was:
This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
ML-specific metadata embedded in its schema.

#Sample code

Suppose we have training events stored on HDFS and user/ad features in Hive, we 
want to assemble features for training and then apply decision tree.
The proposed pipeline with dataset looks like the following (need more 
refinements):

{code}
sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event)
val training = sqlContext.sql(
  SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
event.action AS label,
 user.gender AS userGender, user.country AS userCountry, user.features 
AS userFeatures,
 ad.targetGender AS targetGender
FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
ad.id;).cache()

val indexer = new Indexer()
val interactor = new Interactor()
val fvAssembler = new FeatureVectorAssembler()
val treeClassifer = new DecisionTreeClassifer()

val paramMap = new ParamMap()
  .put(indexer.features, Map(userCountryIndex - userCountry))
  .put(indexer.sortByFrequency, true)
  .put(iteractor.features, Map(genderMatch - Array(userGender, 
targetGender)))
  .put(fvAssembler.features, Map(features - Array(genderMatch, 
userCountryIndex, userFeatures)))
  .put(fvAssembler.dense, true)
  .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
features and label columns.

val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
val model = pipeline.fit(raw, paramMap)

sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
val test = sqlContext.sql(
  SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
 user.gender AS userGender, user.country AS userCountry, user.features 
AS userFeatures,
 ad.targetGender AS targetGender
FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
ad.id;)

val prediction = model.transform(test).select('eventId, 'prediction)
{code}


 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events,

[jira] [Closed] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.


 [ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3560.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

Fixed by https://github.com/apache/spark/pull/2449

 In yarn-cluster mode, the same jars are distributed through multiple 
 mechanisms.
 

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.


 [ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-3560:
--
  Assignee: Min Shen

Reopening just to reassign. Closing right afterwards, please disregard.

 In yarn-cluster mode, the same jars are distributed through multiple 
 mechanisms.
 

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Min Shen
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.


 [ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3560.

Resolution: Fixed

 In yarn-cluster mode, the same jars are distributed through multiple 
 mechanisms.
 

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Min Shen
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3587) Spark SQL can't support lead() over() window function


 [ 
https://issues.apache.org/jira/browse/SPARK-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3587:
---
Labels:   (was: features)

 Spark SQL can't support lead() over() window function
 -

 Key: SPARK-3587
 URL: https://issues.apache.org/jira/browse/SPARK-3587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: operator system is suse 11.3, three node ,every node has 
 48GB MEM and 16 CPU
Reporter: caoli
   Original Estimate: 504h
  Remaining Estimate: 504h

When  run the following SQL:
 select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as 
 start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) 
 as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by 
 c.pastkkdate) as end_time from bay_car_test1 c
  in HIVE, it is OK,
 -but run in Spark SQL with HiveContext, it is error with:
 java.lang.RuntimeException: 
 Unsupported language features in query: select c.plateno, 
 c.platetype,c.bay_id as start_bay_id,c.pastkkdate as 
 start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) 
 as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by 
 c.pastkkdate) as end_time from bay_car_test1 c
 TOK_QUERY
 is Spark SQLcan't support lead() and over() function?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3574) Shuffle finish time always reported as -1


 [ 
https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3574:
---
Component/s: Spark Core

 Shuffle finish time always reported as -1
 -

 Key: SPARK-3574
 URL: https://issues.apache.org/jira/browse/SPARK-3574
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout

 shuffleFinishTime is always reported as -1.  I think the way we fix this 
 should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the 
 shuffles finish, but when aggregating the metrics, only report 
 shuffleFinishTime as something other than -1 when *all* of the shuffles have 
 completed.
 [~sandyr], it looks like this was introduced in your recent patch to 
 incrementally report metrics.  Any chance you can fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2672) Support compression in wholeFile()


 [ 
https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2672:
---
Summary: Support compression in wholeFile()  (was: support compressed file 
in wholeFile())

 Support compression in wholeFile()
 --

 Key: SPARK-2672
 URL: https://issues.apache.org/jira/browse/SPARK-2672
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
   Original Estimate: 72h
  Remaining Estimate: 72h

 The wholeFile() can not read compressed files, it should be, just like 
 textFile().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2761) Merge similar code paths in ExternalSorter and EAOM


 [ 
https://issues.apache.org/jira/browse/SPARK-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2761:
---
Component/s: Spark Core

 Merge similar code paths in ExternalSorter and EAOM
 ---

 Key: SPARK-2761
 URL: https://issues.apache.org/jira/browse/SPARK-2761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 Right now there is quite a lot of duplicate code. It would be good to merge 
 these somehow if we want to make changes to them in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3535) Spark on Mesos not correctly setting heap overhead


 [ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3535:
-
Target Version/s: 1.1.1, 1.2.0  (was: 1.1.1)

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-18 Thread Brenden Matthews (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139763#comment-14139763
 ] 

Brenden Matthews edited comment on SPARK-3535 at 9/18/14 11:58 PM:
---

After some even further digging, I noticed the following error in the Mesos 
slave log:

  E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for 
container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 
20140813-050807-3852091146-5050-1861-231 running task 3 on status update for 
terminal task, destroying container:Collect failed: No cpus resource given

I'll update my patch accordingly.


was (Author: brenden):
After some even futher digging, I noticed the following error in the Mesos 
slave log:

  E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for 
container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 
20140813-050807-3852091146-5050-1861-231 running task 3 on status update for 
terminal task, destroying container:Collect failed: No cpus resource given

I'll update my patch accordingly.

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-18 Thread Brenden Matthews (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139763#comment-14139763
 ] 

Brenden Matthews commented on SPARK-3535:
-

After some even futher digging, I noticed the following error in the Mesos 
slave log:

  E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for 
container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 
20140813-050807-3852091146-5050-1861-231 running task 3 on status update for 
terminal task, destroying container:Collect failed: No cpus resource given

I'll update my patch accordingly.

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3562) Periodic cleanup event logs


[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139786#comment-14139786
 ] 

Matthew Farrellee commented on SPARK-3562:
--

is logrotate an option for you?

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3554) handle large dataset in closure of PySpark

2014-09-18 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3554.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2417
[https://github.com/apache/spark/pull/2417]

 handle large dataset in closure of PySpark
 --

 Key: SPARK-3554
 URL: https://issues.apache.org/jira/browse/SPARK-3554
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 Sometimes there are large dataset used in closure and user forget to use 
 broadcast for it, then the serialized command will become huge.
 py4j can not handle large objects efficiently, we should compress the 
 serialized command and user broadcast for it if it's huge.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-09-18 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139845#comment-14139845
 ] 

Vinod Kone commented on SPARK-3535:
---

This can happen if the spark executor doesn't use any cpus (or memory) and 
there are no tasks running on it. Note that in the next release of Mesos, such 
an executor is not allowed to launch. 
https://issues.apache.org/jira/browse/MESOS-1807

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`

2014-09-18 Thread Brenden Matthews (JIRA)

Brenden Matthews created SPARK-3597:
---

 Summary: MesosSchedulerBackend does not implement `killTask`
 Key: SPARK-3597
 URL: https://issues.apache.org/jira/browse/SPARK-3597
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews
 Fix For: 1.1.1


The MesosSchedulerBackend class does not implement `killTask`, and therefore 
results in exceptions like this:

14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; 
aborting job
14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1
14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1
java.lang.UnsupportedOperationException
at 
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries


[ 
https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139902#comment-14139902
 ] 

Shawn Guo commented on SPARK-3581:
--

Yes, please. Thanks for clarification.

 RDD API(distinct/subtract) does not work for RDD of Dictionaries
 

 Key: SPARK-3581
 URL: https://issues.apache.org/jira/browse/SPARK-3581
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.2, 1.1.0
 Environment: Spark 1.0 1.1
 JDK 1.6
Reporter: Shawn Guo
Priority: Minor

 Construct a RDD of dictionaries(dictRDD), 
 try to use the RDD API, RDD.distinct() or RDD.subtract().
 {code:title=PySpark RDD API Test|borderStyle=solid}
 dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the 
 Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'}))
 dictRDD.distinct().collect()
 dictRDD.subtract(dictRDD).collect()
 {code}
 An error occurred while calling, 
 TypeError: unhashable type: 'dict'
 I'm not sure if it is a bug or expected results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3598) cast to timestamp should be the same as hive

2014-09-18 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-3598:
--

 Summary: cast to timestamp should be the same as hive
 Key: SPARK-3598
 URL: https://issues.apache.org/jira/browse/SPARK-3598
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


select cast(1000 as timestamp) from src limit 1;
should return 1970-01-01 00:00:01

also, current implementation has bug when the time is before 1970-01-01 00:00:00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3321) Defining a class within python main script


[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139903#comment-14139903
 ] 

Shawn Guo commented on SPARK-3321:
--

Yes please, thanks for clarification.

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at

[jira] [Created] (SPARK-3599) Avoid loading and printing properties file content frequently

2014-09-18 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-3599:
--

 Summary: Avoid loading and printing properties file content 
frequently
 Key: SPARK-3599
 URL: https://issues.apache.org/jira/browse/SPARK-3599
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


When I use -v | -verbos in spark-submit, there prints lots of message about 
contents in properties file. 
After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I 
found the getDefaultSparkProperties method is invoked in three places, and 
every time we invoke it, we load properties from properties file, and print 
again if option -v used.

We might should use a value instead of method when we use default properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries


 [ 
https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-3581.

Resolution: Not a Problem

 RDD API(distinct/subtract) does not work for RDD of Dictionaries
 

 Key: SPARK-3581
 URL: https://issues.apache.org/jira/browse/SPARK-3581
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.2, 1.1.0
 Environment: Spark 1.0 1.1
 JDK 1.6
Reporter: Shawn Guo
Priority: Minor

 Construct a RDD of dictionaries(dictRDD), 
 try to use the RDD API, RDD.distinct() or RDD.subtract().
 {code:title=PySpark RDD API Test|borderStyle=solid}
 dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the 
 Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'}))
 dictRDD.distinct().collect()
 dictRDD.subtract(dictRDD).collect()
 {code}
 An error occurred while calling, 
 TypeError: unhashable type: 'dict'
 I'm not sure if it is a bug or expected results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3321) Defining a class within python main script


 [ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-3321.

Resolution: Not a Problem

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)

[jira] [Commented] (SPARK-3573) Dataset

2014-09-18 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139958#comment-14139958
 ] 

Sandy Ryza commented on SPARK-3573:
---

Currently SchemaRDD lives inside SQL.  Would we move it to core if we plan to 
use it in components that aren't related to SQL?

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events, 
 0.01).registerTempTable(event)
 val training = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
 event.action AS label,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;).cache()
 val indexer = new Indexer()
 val interactor = new Interactor()
 val fvAssembler = new FeatureVectorAssembler()
 val treeClassifer = new DecisionTreeClassifer()
 val paramMap = new ParamMap()
   .put(indexer.features, Map(userCountryIndex - userCountry))
   .put(indexer.sortByFrequency, true)
   .put(iteractor.features, Map(genderMatch - Array(userGender, 
 targetGender)))
   .put(fvAssembler.features, Map(features - Array(genderMatch, 
 userCountryIndex, userFeatures)))
   .put(fvAssembler.dense, true)
   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
 features and label columns.
 val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
 treeClassifier)
 val model = pipeline.fit(raw, paramMap)
 sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
 val test = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;)
 val prediction = model.transform(test).select('eventId, 'prediction)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-09-18 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139965#comment-14139965
 ] 

Erik Erlandson commented on SPARK-3250:
---

PR: https://github.com/apache/spark/pull/2455
[SPARK-3250] Implement Gap Sampling optimization for random sampling


 More Efficient Sampling
 ---

 Key: SPARK-3250
 URL: https://issues.apache.org/jira/browse/SPARK-3250
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: RJ Nowling

 Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
 number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
 sampling, where k is the number of data points to sample.  Examples of such 
 algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
 Descent with mini batching.
 More efficient sampling may be achievable by packing partitions with an 
 ArrayBuffer or other data structure supporting random access.  Since many of 
 these stochastic algorithms perform repeated rounds of sampling, it may be 
 feasible to perform a transformation to change the backing data structure 
 followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3573) Dataset


[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140017#comment-14140017
 ] 

Patrick Wendell commented on SPARK-3573:


[~sandyr] This is a good question I'm not sure how easy it would be to decouple 
SchemaRDD from the other things inside of sql/core. This definitely doesn't 
need to depend on catalyst or on hive, but it might need to depend on the 
entire sql core. I've been thinking about whether this is bad to have a growing 
number cross-dependencies in the projects. Do you see specific drawbacks here 
if that becomes the case?

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events, 
 0.01).registerTempTable(event)
 val training = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
 event.action AS label,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;).cache()
 val indexer = new Indexer()
 val interactor = new Interactor()
 val fvAssembler = new FeatureVectorAssembler()
 val treeClassifer = new DecisionTreeClassifer()
 val paramMap = new ParamMap()
   .put(indexer.features, Map(userCountryIndex - userCountry))
   .put(indexer.sortByFrequency, true)
   .put(iteractor.features, Map(genderMatch - Array(userGender, 
 targetGender)))
   .put(fvAssembler.features, Map(features - Array(genderMatch, 
 userCountryIndex, userFeatures)))
   .put(fvAssembler.dense, true)
   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
 features and label columns.
 val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
 treeClassifier)
 val model = pipeline.fit(raw, paramMap)
 sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
 val test = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;)
 val prediction = model.transform(test).select('eventId, 'prediction)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-09-18 Thread David Rosenstrauch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140023#comment-14140023
 ] 

David Rosenstrauch commented on SPARK-2058:
---

I'm wondering the same:  has this fix made it into a release?

 SPARK_CONF_DIR should override all present configs
 --

 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Priority: Critical

 When the user defines SPARK_CONF_DIR I think spark should use all the configs 
 available there not only spark-env.
 This involves changing SparkSubmitArguments to first read from 
 SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
 computed classpath for configs such as log4j, metrics, etc.
 I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3270) Spark API for Application Extensions