[jira] [Updated] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries
[ https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Guo updated SPARK-3581: - Affects Version/s: 1.0.0 1.0.2 RDD API(distinct/subtract) does not work for RDD of Dictionaries Key: SPARK-3581 URL: https://issues.apache.org/jira/browse/SPARK-3581 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.2, 1.1.0 Environment: Spark 1.0 1.1 JDK 1.6 Reporter: Shawn Guo Priority: Minor Construct a RDD of dictionaries(dictRDD), try to use the RDD API, RDD.distinct() or RDD.subtract(). {code:title=PySpark RDD API Test|borderStyle=solid} dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'})) dictRDD.distinct().collect() dictRDD.subtract(dictRDD).collect() {code} An error occurred while calling, TypeError: unhashable type: 'dict' I'm not sure if it is a bug or expected results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138569#comment-14138569 ] Shawn Guo commented on SPARK-3321: -- No idea yet, I use --py-files Null.py instead. it seems a pickle lib limitation and existed long time ago. Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Minor *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at
[jira] [Created] (SPARK-3583) Spark run slow after unexpected repartition
ShiShu created SPARK-3583: - Summary: Spark run slow after unexpected repartition Key: SPARK-3583 URL: https://issues.apache.org/jira/browse/SPARK-3583 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Reporter: ShiShu Hi dear all~ My spark application sometimes runs much slower than it use to be, so I wonder why would this happen. I find out that after a repartition stage of stage 17, all tasks go to one executor. But in my code, I only use repartition at the very beginning. In my application, before stage 17, every stage run sucessfully within 1 minute, but after stage 17, it cost more than 10 minutes for every stage. Normally my application runs succcessfully and will finish within 9 minites. My spark version is 0.9.1, and my program is writen by scala. I take some screenshots but don't know how to post it, pls tell me if you need. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3583) Spark run slow after unexpected repartition
[ https://issues.apache.org/jira/browse/SPARK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShiShu updated SPARK-3583: -- Attachment: spark_q_006.jpg spark_q_005.jpg spark_q_004.jpg spark_q_001.jpg these are the screenshots I take. Spark run slow after unexpected repartition --- Key: SPARK-3583 URL: https://issues.apache.org/jira/browse/SPARK-3583 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Reporter: ShiShu Labels: easyfix Attachments: spark_q_001.jpg, spark_q_004.jpg, spark_q_005.jpg, spark_q_006.jpg Hi dear all~ My spark application sometimes runs much slower than it use to be, so I wonder why would this happen. I find out that after a repartition stage of stage 17, all tasks go to one executor. But in my code, I only use repartition at the very beginning. In my application, before stage 17, every stage run sucessfully within 1 minute, but after stage 17, it cost more than 10 minutes for every stage. Normally my application runs succcessfully and will finish within 9 minites. My spark version is 0.9.1, and my program is writen by scala. I take some screenshots but don't know how to post it, pls tell me if you need. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result
[ https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138638#comment-14138638 ] Ankur Dave commented on SPARK-3578: --- [~pwendell] Sorry, I forgot to do that this time. GraphGenerators.sampleLogNormal sometimes returns too-large result -- Key: SPARK-3578 URL: https://issues.apache.org/jira/browse/SPARK-3578 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Priority: Minor GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to round X down instead of to the nearest integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1353) IllegalArgumentException when writing to disk
[ https://issues.apache.org/jira/browse/SPARK-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1353. Resolution: Duplicate IllegalArgumentException when writing to disk - Key: SPARK-1353 URL: https://issues.apache.org/jira/browse/SPARK-1353 Project: Spark Issue Type: Bug Components: Block Manager Environment: AWS EMR 3.2.30-49.59.amzn1.x86_64 #1 SMP x86_64 GNU/Linux Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 built 2014-03-18 Reporter: Jim Blomo Priority: Minor The Executor may fail when trying to mmap a file bigger than Integer.MAX_VALUE due to the constraints of FileChannel.map (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode, long, long)). The signature takes longs, but the size value must be less than MAX_VALUE. This manifests with the following backtrace: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:98) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:337) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:281) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:430) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:38) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3525) Gradient boosting in MLLib
[ https://issues.apache.org/jira/browse/SPARK-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138674#comment-14138674 ] Egor Pakhomov commented on SPARK-3525: -- https://github.com/apache/spark/pull/2394 Gradient boosting in MLLib -- Key: SPARK-3525 URL: https://issues.apache.org/jira/browse/SPARK-3525 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Egor Pakhomov Assignee: Egor Pakhomov Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH
Kousuke Saruta created SPARK-3584: - Summary: sbin/slaves doesn't work when we use password authentication for SSH Key: SPARK-3584 URL: https://issues.apache.org/jira/browse/SPARK-3584 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta In sbin/slaves, ssh command run in the background but if we use password authentication, background ssh command doesn't work so sbin/slaves doesn't work. Also I suggest improvement for sbin/slaves. In current implementation, slaves file is trucked by Git but it can be edited by user so we prepare slaves.template instead of slaves. Default slaves file has one entry, localhost, so we should use localhost as a default host list. I modified sbin/slaves to choose localhost as a default host list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH
[ https://issues.apache.org/jira/browse/SPARK-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138716#comment-14138716 ] Apache Spark commented on SPARK-3584: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2444 sbin/slaves doesn't work when we use password authentication for SSH Key: SPARK-3584 URL: https://issues.apache.org/jira/browse/SPARK-3584 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta In sbin/slaves, ssh command run in the background but if we use password authentication, background ssh command doesn't work so sbin/slaves doesn't work. Also I suggest improvement for sbin/slaves. In current implementation, slaves file is trucked by Git but it can be edited by user so we prepare slaves.template instead of slaves. Default slaves file has one entry, localhost, so we should use localhost as a default host list. I modified sbin/slaves to choose localhost as a default host list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3585) Probability Values
Tamilselvan Palani created SPARK-3585: - Summary: Probability Values Key: SPARK-3585 URL: https://issues.apache.org/jira/browse/SPARK-3585 Project: Spark Issue Type: Question Components: MLlib Affects Versions: 1.1.0 Environment: Development Reporter: Tamilselvan Palani Priority: Blocker Fix For: 1.1.0 Unable to get actual probability values BY DEFAULT along with prediction, while running Logistic Regression or Decision Tree Please help with following questions: 1. Is there an API that can be called to force output to include Probability values along with predicted values? 2. Does it require customization? if so in which package/class? Please let me know if I can provide any further information. Thanks and Regards, Tamilselvan Palani -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3585) Probability Values in Logistic Regression/Decision Tree output
[ https://issues.apache.org/jira/browse/SPARK-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tamilselvan Palani updated SPARK-3585: -- Summary: Probability Values in Logistic Regression/Decision Tree output (was: Probability Values in Logistic Regression output ) Probability Values in Logistic Regression/Decision Tree output --- Key: SPARK-3585 URL: https://issues.apache.org/jira/browse/SPARK-3585 Project: Spark Issue Type: Question Components: MLlib Affects Versions: 1.1.0 Environment: Development Reporter: Tamilselvan Palani Priority: Blocker Labels: newbie Fix For: 1.1.0 Original Estimate: 168h Remaining Estimate: 168h Unable to get actual probability values BY DEFAULT along with prediction, while running Logistic Regression or Decision Tree Please help with following questions: 1. Is there an API that can be called to force output to include Probability values along with predicted values? 2. Does it require customization? if so in which package/class? Please let me know if I can provide any further information. Thanks and Regards, Tamilselvan Palani -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3586) spark streaming
wangxj created SPARK-3586: - Summary: spark streaming Key: SPARK-3586 URL: https://issues.apache.org/jira/browse/SPARK-3586 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: wangxj Priority: Minor Fix For: 1.1.0 For text files, the method streamingContext.textFileStream(dataDirectory). Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.but files written in nested directories not supported eg streamingContext.textFileStream(/test). Look at the direction contents: /test/file1 /test/file2 /test/dr/file1 In this mothod the textFileStream can only read file: /test/file1 /test/file2 /test/dr/ but the file: /test/dr/file1 is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3587) Spark SQL can't support lead() over() window function
caoli created SPARK-3587: Summary: Spark SQL can't support lead() over() window function Key: SPARK-3587 URL: https://issues.apache.org/jira/browse/SPARK-3587 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.0.2 Environment: operator system is suse 11.3, three node ,every node has 48GB MEM and 16 CPU Reporter: caoli When run the following SQL: select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by c.pastkkdate) as end_time from bay_car_test1 c in HIVE, it is OK, -but run in Spark SQL with HiveContext, it is error with: java.lang.RuntimeException: Unsupported language features in query: select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by c.pastkkdate) as end_time from bay_car_test1 c TOK_QUERY is Spark SQLcan't support lead() and over() function? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3588) Gaussian Mixture Model clustering
Meethu Mathew created SPARK-3588: Summary: Gaussian Mixture Model clustering Key: SPARK-3588 URL: https://issues.apache.org/jira/browse/SPARK-3588 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Meethu Mathew Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is(8 Cores,8 GB RAM) and the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-3588: - Description: Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| was: Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is(8 Cores,8 GB RAM) and the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; |
[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meethu Mathew updated SPARK-3588: - Attachment: GMMSpark.py Gaussian Mixture Model clustering - Key: SPARK-3588 URL: https://issues.apache.org/jira/browse/SPARK-3588 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Meethu Mathew Attachments: GMMSpark.py Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138782#comment-14138782 ] Meethu Mathew commented on SPARK-3588: -- We are interested in contributing this implementation as a patch to mllib. Could you please review and suggest how to take this forward? Gaussian Mixture Model clustering - Key: SPARK-3588 URL: https://issues.apache.org/jira/browse/SPARK-3588 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Meethu Mathew Attachments: GMMSpark.py Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics. We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component. We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets. Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances. ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|| |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations | |0.7million| nbsp;nbsp;nbsp;13 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 7s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;26min nbsp;nbsp;nbsp;| |1.8million| nbsp;nbsp;nbsp;11 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 17s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;53min nbsp;nbsp;nbsp; | |10million|nbsp;nbsp;nbsp;16 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.6min nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2175) Null values when using App trait.
[ https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138803#comment-14138803 ] Philip Wills commented on SPARK-2175: - Whilst the workaround for this is trivial, discovering it's the problem isn't. It would definitely reduce the amount of swearing involved in running a first app if there was some way this could be detected and warned/errored on. Null values when using App trait. - Key: SPARK-2175 URL: https://issues.apache.org/jira/browse/SPARK-2175 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Linux Reporter: Brandon Amos Priority: Trivial See http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3589) [Minor]Remove redundant code in deploy module
WangTaoTheTonic created SPARK-3589: -- Summary: [Minor]Remove redundant code in deploy module Key: SPARK-3589 URL: https://issues.apache.org/jira/browse/SPARK-3589 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor export CLASSPATH in spark-class is redundant. We could reuse value isYarnCluster in SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3589) [Minor]Remove redundant code in deploy module
[ https://issues.apache.org/jira/browse/SPARK-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138821#comment-14138821 ] Apache Spark commented on SPARK-3589: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2445 [Minor]Remove redundant code in deploy module - Key: SPARK-3589 URL: https://issues.apache.org/jira/browse/SPARK-3589 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor export CLASSPATH in spark-class is redundant. We could reuse value isYarnCluster in SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138829#comment-14138829 ] Alexander Ulanov commented on SPARK-3403: - Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java (https://github.com/fommil/netlib-java)? I thought the latter has jni implementation. I it ok to submit it as is? NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138867#comment-14138867 ] Matthew Farrellee commented on SPARK-3321: -- [~guoxu1231] i think so too. ok if i close this? Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Minor *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at
[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138873#comment-14138873 ] mohan gaddam commented on SPARK-3447: - I am also facing the same issue with spark streaming API, kryo serializer and Avro Objects. I have observed this behavior with output operations like print, collect etc. also observed that if the avro object is simple, no problem. but if the avro objects are complex with unions/Arrays, gives the exception. find the stack trace below: ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (com.globallogic.goliath.model.Datum) data (com.globallogic.goliath.model.ResourceMessage) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) below is the avro message. {version: 01, sequence: 1, resource: sensor-001, controller: 002, controllerTimestamp: 1411038710358, data: {value: [{name: Temperature, value: 30}, {name: Speed, value: 60}, {name: Location, value: [+401213.1, -0750015.1]}, {name: Timestamp, value: 2014-09-09T08:15:25-05:00}]}} message is been successfully decoded in decoder, but throws exception for output operation. Kryo NPE when serializing JListWrapper -- Key: SPARK-3447 URL: https://issues.apache.org/jira/browse/SPARK-3447 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Repro (provided by [~davies]): {code} from pyspark.sql import SQLContext; SQLContext(sc).inferSchema(sc.parallelize([{a: [3]}]))._jschema_rdd.collect() {code} {code} 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) values (org.apache.spark.sql.catalyst.expressions.GenericRow) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at
[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138884#comment-14138884 ] Yin Huai commented on SPARK-3447: - [~mohan.gadm] From the trace, seems the NPE was caused by {code} value (com.globallogic.goliath.model.Datum) data (com.globallogic.goliath.model.ResourceMessage) {code} Can you check these two classes (I can not find them online)? Seems the avro message was decoded and then you stored the message as a com.globallogic.goliath.model.ResourceMessage. In a ResourceMessage, an array was repreented by Datum, which caused the problem. Kryo NPE when serializing JListWrapper -- Key: SPARK-3447 URL: https://issues.apache.org/jira/browse/SPARK-3447 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Repro (provided by [~davies]): {code} from pyspark.sql import SQLContext; SQLContext(sc).inferSchema(sc.parallelize([{a: [3]}]))._jschema_rdd.collect() {code} {code} 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) values (org.apache.spark.sql.catalyst.expressions.GenericRow) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) Caused by: java.lang.NullPointerException at scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138899#comment-14138899 ] mohan gaddam commented on SPARK-3447: - sorry for the mistake, those are the project specific classes. and those are the classes generated by Avro itself from an avro avdl file and available in driver jar and also worker nodes. Kryo NPE when serializing JListWrapper -- Key: SPARK-3447 URL: https://issues.apache.org/jira/browse/SPARK-3447 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Repro (provided by [~davies]): {code} from pyspark.sql import SQLContext; SQLContext(sc).inferSchema(sc.parallelize([{a: [3]}]))._jschema_rdd.collect() {code} {code} 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) values (org.apache.spark.sql.catalyst.expressions.GenericRow) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) Caused by: java.lang.NullPointerException at scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138906#comment-14138906 ] Helena Edelson commented on SPARK-2593: --- [~matei] +1 for spark streaming, that is a primary concern here. And I understand your concern over support for akka upgrades. However I am more than happy to help WRT that and am sure I can find a few others that feel the same so that time isn't taken away from your team's new features/enhancements bandwidth. I will get more data on have 2 actor systems on a node. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138911#comment-14138911 ] mohan gaddam commented on SPARK-3447: - record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; //can be of any one of them mentioned in union. union {boolean, int, long, float, double, bytes, string, arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, KeyValueObject} value; } record ResourceMessage { string version; string sequence; string resource; string controller; string controllerTimestamp; union {Datum, arrayDatum} data; } Kryo NPE when serializing JListWrapper -- Key: SPARK-3447 URL: https://issues.apache.org/jira/browse/SPARK-3447 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Repro (provided by [~davies]): {code} from pyspark.sql import SQLContext; SQLContext(sc).inferSchema(sc.parallelize([{a: [3]}]))._jschema_rdd.collect() {code} {code} 14/09/05 21:59:47 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) values (org.apache.spark.sql.catalyst.expressions.GenericRow) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1276) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) Caused by: java.lang.NullPointerException at scala.collection.convert.Wrappers$MutableBufferWrapper.add(Wrappers.scala:80) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 23 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3447) Kryo NPE when serializing JListWrapper
[ https://issues.apache.org/jira/browse/SPARK-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138873#comment-14138873 ] mohan gaddam edited comment on SPARK-3447 at 9/18/14 1:21 PM: -- I am also facing the same issue with spark streaming API, kryo serializer and Avro Objects. I have observed this behavior with output operations like print, collect etc. also observed that if the avro object is simple, no problem. but if the avro objects are complex with unions/Arrays, gives the exception. find the stack trace below: ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResourceMessage) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) below is the avro message. {version: 01, sequence: 1, resource: sensor-001, controller: 002, controllerTimestamp: 1411038710358, data: {value: [{name: Temperature, value: 30}, {name: Speed, value: 60}, {name: Location, value: [+401213.1, -0750015.1]}, {name: Timestamp, value: 2014-09-09T08:15:25-05:00}]}} message is been successfully decoded in decoder, but throws exception for output operation. was (Author: mohan.gadm): I am also facing the same issue with spark streaming API, kryo serializer and Avro Objects. I have observed this behavior with output operations like print, collect etc. also observed that if the avro object is simple, no problem. but if the avro objects are complex with unions/Arrays, gives the exception. find the stack trace below: ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (com.globallogic.goliath.model.Datum) data (com.globallogic.goliath.model.ResourceMessage) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138970#comment-14138970 ] Apache Spark commented on SPARK-1987: - User 'larryxiao' has created a pull request for this issue: https://github.com/apache/spark/pull/2446 More memory-efficient graph construction Key: SPARK-1987 URL: https://issues.apache.org/jira/browse/SPARK-1987 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave A graph's edges are usually the largest component of the graph. GraphX currently stores edges in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge additionally takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This unnecessarily increases GraphX's memory requirements by a factor of 3. To save memory, EdgePartitionBuilder should instead use a custom sort routine that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3557) Yarn client config prioritization is backwards
[ https://issues.apache.org/jira/browse/SPARK-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3557. -- Resolution: Duplicate Yarn client config prioritization is backwards -- Key: SPARK-3557 URL: https://issues.apache.org/jira/browse/SPARK-3557 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or In YarnClientSchedulerBackend, we have: {code} if (System.getenv(envVar) != null) { arrayBuf += (optionName, System.getenv(envVar)) } else if (sc.getConf.contains(sysProp)) { arrayBuf += (optionName, sc.getConf.get(sysProp)) } {code} Elsewhere in Spark we try to honor Spark configs over environment variables. This was introduced as a fix for the Yarn app name (SPARK-1631), but this also changed the behavior for other configs. Perhaps we should special case this particular config and correct the prioritization order of the other configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3557) Yarn client config prioritization is backwards
[ https://issues.apache.org/jira/browse/SPARK-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138993#comment-14138993 ] Thomas Graves commented on SPARK-3557: -- This is a dup of SPARK-2872, although this has better description Yarn client config prioritization is backwards -- Key: SPARK-3557 URL: https://issues.apache.org/jira/browse/SPARK-3557 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or In YarnClientSchedulerBackend, we have: {code} if (System.getenv(envVar) != null) { arrayBuf += (optionName, System.getenv(envVar)) } else if (sc.getConf.contains(sysProp)) { arrayBuf += (optionName, sc.getConf.get(sysProp)) } {code} Elsewhere in Spark we try to honor Spark configs over environment variables. This was introduced as a fix for the Yarn app name (SPARK-1631), but this also changed the behavior for other configs. Perhaps we should special case this particular config and correct the prioritization order of the other configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138996#comment-14138996 ] Thomas Graves commented on SPARK-2872: -- adding description from spark-3557 as it explains it better: n YarnClientSchedulerBackend, we have: if (System.getenv(envVar) != null) { arrayBuf += (optionName, System.getenv(envVar)) } else if (sc.getConf.contains(sysProp)) { arrayBuf += (optionName, sc.getConf.get(sysProp)) } Elsewhere in Spark we try to honor Spark configs over environment variables. This was introduced as a fix for the Yarn app name (SPARK-1631), but this also changed the behavior for other configs. Perhaps we should special case this particular config and correct the prioritization order of the other configs. Fix conflict between code and doc in YarnClientSchedulerBackend --- Key: SPARK-2872 URL: https://issues.apache.org/jira/browse/SPARK-2872 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Zhihui Doc say: system properties override environment variables. https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71 But code is conflict with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139010#comment-14139010 ] Apache Spark commented on SPARK-3389: - User 'patmcdonough' has created a pull request for this issue: https://github.com/apache/spark/pull/2447 Add converter class to make reading Parquet files easy with PySpark --- Key: SPARK-3389 URL: https://issues.apache.org/jira/browse/SPARK-3389 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Uri Laserson If a user wants to read Parquet data from PySpark, they currently must use SparkContext.newAPIHadoopFile. If they do not provide a valueConverter, they will get JSON string that must be parsed. Here I add a Converter implementation based on the one in the AvroConverters.scala file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages
[ https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139043#comment-14139043 ] Matthew Farrellee commented on SPARK-3580: -- what do you think about going the other direction, adding a partitions property to RDDs in python? given that an RDD is a list of partitions, a function for computing each split, a list of deps on other RDDs, etc, it makes sense that you could access a someRDD.partitions, and doing so looks to be the preferred method in scala. so, instead of a someRDD.getNumPartitions(), python code could use a more idiomatic len(someRDD.partitions). Add Consistent Method To Get Number of RDD Partitions Across Different Languages Key: SPARK-3580 URL: https://issues.apache.org/jira/browse/SPARK-3580 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Pat McDonough Labels: starter Programmatically retrieving the number of partitions is not consistent between python and scala. A consistent method should be defined and made public across both languages. RDD.partitions.size is also used quite frequently throughout the internal code, so that might be worth refactoring as well once the new method is available. What we have today is below. In Scala: {code} scala someRDD.partitions.size res0: Int = 30 {code} In Python: {code} In [2]: someRDD.getNumPartitions() Out[2]: 30 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139061#comment-14139061 ] Gino Bustelo commented on SPARK-2892: - Any update on this? Will it get fixed for 1.0.3 or 1.0.1 Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139061#comment-14139061 ] Gino Bustelo edited comment on SPARK-2892 at 9/18/14 3:30 PM: -- Any update on this? Will it get fixed for 1.0.3 or 1.1.0 was (Author: lbustelo): Any update on this? Will it get fixed for 1.0.3 or 1.0.1 Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3270) Spark API for Application Extensions
[ https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3270: - Issue Type: New Feature (was: Improvement) Spark API for Application Extensions Key: SPARK-3270 URL: https://issues.apache.org/jira/browse/SPARK-3270 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Michal Malohlava Any application should be able to enrich spark infrastructure by services which are not available by default. Hence, to support such application extensions (aka extesions/plugins) Spark platform should provide: - an API to register an extension - an API to register a service (meaning provided functionality) - well-defined points in Spark infrastructure which can be enriched/hooked by extension - a way of deploying extension (for example, simply putting the extension on classpath and using Java service interface) - a way to access extension from application Overall proposal is available here: https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing Note: In this context, I do not mean reinventing OSGi (or another plugin platform) but it can serve as a good starting point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line
[ https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-1576. --- Resolution: Not a Problem spark-submit already supports this with existing options. Passing of JAVA_OPTS to YARN on command line Key: SPARK-1576 URL: https://issues.apache.org/jira/browse/SPARK-1576 Project: Spark Issue Type: Improvement Affects Versions: 0.9.0, 0.9.1, 0.9.2 Reporter: Nishkam Ravi Attachments: SPARK-1576.patch JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It would be good to allow the user to pass them on command line as well to restrict scope to single application invocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139163#comment-14139163 ] Matei Zaharia commented on SPARK-2593: -- Sure, it would be great to do this for streaming. For the other stuff by the way, my concern isn't so much upgrades, it's locking users in to a specific Spark version. For example, suppose we add an API involving classes from Akka 2.2 today, and later 2.3 or 2.4 changes those classes in a way that isn't compatible. Then to meet our API stability promises to our users, we're going to keep our version of Akka at 2.2, and users simply won't be able to use a new Akka in the same application as Spark. With Akka in particular you sometimes have to upgrade in order to use a new Scala version too, which is another reason we can't lock it into our API. Basically in these cases it's just very risky to expose fast-moving third-party classes if you want to have a stable API. We've been bitten by exposing stuff as mundane as Guava or Google Protobufs (!) because of incompatible changes in minor versions. We care a lot about API stability within Spark and in particular about shielding our users from the fast-moving APIs in distributed systems land. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3547) Maybe we should not simply make return code 1 equal to CLASS_NOT_FOUND
[ https://issues.apache.org/jira/browse/SPARK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3547. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: WangTaoTheTonic Resolved by: https://github.com/apache/spark/pull/2421 Maybe we should not simply make return code 1 equal to CLASS_NOT_FOUND -- Key: SPARK-3547 URL: https://issues.apache.org/jira/browse/SPARK-3547 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.2.0 It incurred runtime exception when hadoop version is not A.B.* format, which is detected by Hive. Then the jvm return code is 1, while equals to CLASS_NOT_FOUND_EXIT_STATUS in start-thriftserver.sh script. It proves even runtime exception can lead the jvm existed with code 1. Should we just modify the misleading error message in script ? The error message in script: CLASS_NOT_FOUND_EXIT_STATUS=1 if [[ exit_status -eq CLASS_NOT_FOUND_EXIT_STATUS ]]; then echo echo Failed to load Hive Thrift server main class $CLASS. echo You need to build Spark with -Phive. fi Below is exception stack I met: [omm@dc1-rack1-host2 sbin]$ ./start-thriftserver.sh log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Exception in thread main java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 (expected A.B.* format) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:286) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:54) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:332) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:79) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 (expected A.B.* format) at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:368) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:278) ... 9 more Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 (expected A.B.* format) at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:53) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:365) ... 10 more Caused by: java.lang.RuntimeException: Illegal Hadoop Version: V100R001C00 (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:141) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:113) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:80) at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51) ... 13 more Failed to load Hive Thrift server main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2. You need to build Spark with -Phive. I tested runtime exception and ioexception today, and JVM will also return with exit code 1. Below is my code and error it lead. Code, throw ArrayIndexOutOfBoundsException and FileNotFoundException: object ExitCodeWithRuntimeException { def main(args: Array[String]): Unit = { if(args(0).equals(array)) arrayIndexOutOfBoundsException(args) else if(args(0).equals(file)) fileNotFoundException() } def arrayIndexOutOfBoundsException(args: Array[String]): Unit = {
[jira] [Resolved] (SPARK-3579) Jekyll doc generation is different across environments
[ https://issues.apache.org/jira/browse/SPARK-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3579. Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2443 [https://github.com/apache/spark/pull/2443] Jekyll doc generation is different across environments -- Key: SPARK-3579 URL: https://issues.apache.org/jira/browse/SPARK-3579 Project: Spark Issue Type: Bug Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.2.0 This can result in a lot of false changes when someone alters something with the docs. It is relevant to the both the Spark website (maintained in a separate subversion repo) and the Spark docs. There are at least two issues here. One is that the HTML character escaping can be different in certain cases. Another is that the highlighting output seems a bit different depending on (I think) what version of pygments is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1477) Add the lifecycle interface
[ https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1477. Resolution: Won't Fix Unless we are planning to interact with these components in a generic way, there is not a good reason to add this interface (see relevant discussion in JIRA). Add the lifecycle interface --- Key: SPARK-1477 URL: https://issues.apache.org/jira/browse/SPARK-1477 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Guoqiang Li Assignee: Guoqiang Li Now the Spark in the code, there are a lot of interface or class defines the stop and start method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala] . we should use a life cycle interface improve the code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3593) Support Sorting of Binary Type Data
Paul Magid created SPARK-3593: - Summary: Support Sorting of Binary Type Data Key: SPARK-3593 URL: https://issues.apache.org/jira/browse/SPARK-3593 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Paul Magid If you try sorting on a binary field you currently get an exception. Please add support for binary data type sorting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1477) Add the lifecycle interface
[ https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139218#comment-14139218 ] Patrick Wendell edited comment on SPARK-1477 at 9/18/14 5:34 PM: - Unless we are planning to interact with these components in a generic way, there is not a good reason to add this interface (see relevant discussion in github). was (Author: pwendell): Unless we are planning to interact with these components in a generic way, there is not a good reason to add this interface (see relevant discussion in JIRA). Add the lifecycle interface --- Key: SPARK-1477 URL: https://issues.apache.org/jira/browse/SPARK-1477 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Guoqiang Li Assignee: Guoqiang Li Now the Spark in the code, there are a lot of interface or class defines the stop and start method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala] . we should use a life cycle interface improve the code -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139233#comment-14139233 ] Li Pu commented on SPARK-3530: -- Nice design doc! I had some experiences on the parameter part. It would be great to have Constraints on the individual parameters and on the Params level. For example, learning_rate must be greater than 0, and regularization can be one of l1, l2. Parameter check is something that every learning algorithm does, so some support at parameter definition time would make code more concise. abstract class ParamConstraint[T] extends Serializable { def isValid(value: T): Boolean def invalidMessage(value: T): String } class IntRangeConstraint(min: Int, max: Int) extends ParamConstraint[Int] { def isValid(value: Int) = value = min value = max def invalidMessage(value: T) = ... } class Param[T] (..., constraints: List[ParamConstraint[T]] = List()) {...} // constraints is a list because there might be more than one type of constraints applied to this Param at definition time, we can write: val maxIter: Param[Int] = new Param(id, “maxIter”, “max number of iterations”, 100, List(new IntRangeConstraint(1, 500))) There shouldn't be too many types of constraints, so ml could provide a list of commonly used constraint classes. Keeping parameter definition and constraint in the same line also improves readability. Params trait could use similar structure to check constraints on multiple parameters, but this is less likely to happen in real use cases. In the end, validateParams of Params just call isValid of all member Param as default implementation. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3592) applySchema to an RDD of Row
[ https://issues.apache.org/jira/browse/SPARK-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139299#comment-14139299 ] Apache Spark commented on SPARK-3592: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2448 applySchema to an RDD of Row Key: SPARK-3592 URL: https://issues.apache.org/jira/browse/SPARK-3592 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Right now, we can not appy schema to a RDD of Row, this should be a Bug, {code} srdd = sqlCtx.jsonRDD(sc.parallelize([{a:2}])) sqlCtx.applySchema(srdd.map(lambda x:x), srdd.schema()) Traceback (most recent call last): File stdin, line 1, in module File /Users/daviesliu/work/spark/python/pyspark/sql.py, line 1121, in applySchema _verify_type(row, schema) File /Users/daviesliu/work/spark/python/pyspark/sql.py, line 736, in _verify_type % (dataType, type(obj))) TypeError: StructType(List(StructField(a,IntegerType,true))) can not accept abject in type class 'pyspark.sql.Row' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3560) In yarn-cluster mode, jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139339#comment-14139339 ] Apache Spark commented on SPARK-3560: - User 'Victsm' has created a pull request for this issue: https://github.com/apache/spark/pull/2449 In yarn-cluster mode, jars are distributed through multiple mechanisms. --- Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Priority: Critical In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3566) .gitignore and .rat-excludes should consider Windows cmd file and Emacs' backup files
[ https://issues.apache.org/jira/browse/SPARK-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3566. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) https://github.com/apache/spark/pull/2426 .gitignore and .rat-excludes should consider Windows cmd file and Emacs' backup files - Key: SPARK-3566 URL: https://issues.apache.org/jira/browse/SPARK-3566 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.2.0 Current .gitignore and .rat-excludes does not consider spark-env.cmd. Also, .gitignore doesn't consider emacs' meta files (backup file which starts with and ends with # and lock file which starts with .#) even though considers vi's meta file (*.swp). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3589) [Minor]Remove redundant code in deploy module
[ https://issues.apache.org/jira/browse/SPARK-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3589. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: WangTaoTheTonic [Minor]Remove redundant code in deploy module - Key: SPARK-3589 URL: https://issues.apache.org/jira/browse/SPARK-3589 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.1.1, 1.2.0 export CLASSPATH in spark-class is redundant. We could reuse value isYarnCluster in SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139396#comment-14139396 ] Zhan Zhang commented on SPARK-1537: --- Do you have any update on this, or any schedule in your mind yet? Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3560: -- Summary: In yarn-cluster mode, the same jars are distributed through multiple mechanisms. (was: In yarn-cluster mode, jars are distributed through multiple mechanisms.) In yarn-cluster mode, the same jars are distributed through multiple mechanisms. Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Priority: Critical In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile
Ian Hummel created SPARK-3595: - Summary: Spark should respect configured OutputCommitter when using saveAsHadoopFile Key: SPARK-3595 URL: https://issues.apache.org/jira/browse/SPARK-3595 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Ian Hummel When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be a {{FileOutputCommitter}}. When using Spark on an EMR cluster to process and write files to/from S3, the default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid writing to a temporary directory and doing a copy. Will submit a patch via GitHub shortly. Cheers, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139461#comment-14139461 ] Xiangrui Meng commented on SPARK-3403: -- Sorry, it should be netlib-java, but the real cause may hide in JNI. Sam (@fommil) definitely knows more about this. NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) - Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.2.0 Attachments: NativeNN.scala Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: Process finished with exit code -1073741819 (0xC005) after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3596) Support changing the yarn client monitor interval
Thomas Graves created SPARK-3596: Summary: Support changing the yarn client monitor interval Key: SPARK-3596 URL: https://issues.apache.org/jira/browse/SPARK-3596 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Right now spark on yarn has a monitor interval that can be configured by spark.yarn.report.interval. This is how often the client checks with the RM to get status on the running application in cluster mode. We should allow users to set this interval as some may not need to check so often. There is another jira filed to make it so the client doesn't have to stay around for cluster mode. With the changes in https://github.com/apache/spark/pull/2350, it further extends that to affect client mode. We may want to add in specific configs for that since the monitorApplication function is now used in multiple different scenarios it actually might make sense for it to take the timeout as a parameter. You could want different timeout for different situations. for instance how quickly we poll on client side and print information (cluster mode) vs how quickly we recognize the application quit and we want to terminate (client mode), I want the latter to happen quickly where as in cluster mode I might not care as much about how often it is printing updated info to the screen. I guess its private so we could leave it as is and change if we add support for that later. my suggestion for name would be something like spark.yarn.client.progress.pollinterval. If we were to add separate ones in the future then they could be something like spark.yarn.app.ready.pollinterval and spark.yarn.app.completion.pollinterval -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139509#comment-14139509 ] Apache Spark commented on SPARK-3595: - User 'themodernlife' has created a pull request for this issue: https://github.com/apache/spark/pull/2450 Spark should respect configured OutputCommitter when using saveAsHadoopFile --- Key: SPARK-3595 URL: https://issues.apache.org/jira/browse/SPARK-3595 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Ian Hummel When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be a {{FileOutputCommitter}}. When using Spark on an EMR cluster to process and write files to/from S3, the default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid writing to a temporary directory and doing a copy. Will submit a patch via GitHub shortly. Cheers, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139549#comment-14139549 ] Apache Spark commented on SPARK-1486: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/2451 Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Critical It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES
[ https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3340: - Assignee: (was: Andrew Or) Deprecate ADD_JARS and ADD_FILES Key: SPARK-3340 URL: https://issues.apache.org/jira/browse/SPARK-3340 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or These were introduced before Spark submit even existed. Now that there are many better ways of setting jars and python files through Spark submit, we should deprecate these environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139600#comment-14139600 ] Xiangrui Meng commented on SPARK-3530: -- [~eustache] The default implementation of multi-model training will be a for loop. But the API leaves space for future optimizations, like group weight vectors and using level-3 BLAS for better performance. It shouldn't be a meta class, because many optimizations are specific. For example, LASSO can be solved via LARS, which computes a full solution path for all regularization parameters. The level-3 BLAS optimization is another example, which can give 8x speedup (SPARK-1486). [~vrilleup] We can have a set of built-in preconditions, like positivity. Or we could accept lambda function for assertions (T) = Unit, which may be hard for Java users but they should be familiar of creating those in Spark. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139600#comment-14139600 ] Xiangrui Meng edited comment on SPARK-3530 at 9/18/14 10:06 PM: [~eustache] The default implementation of multi-model training will be a for loop. But the API leaves space for future optimizations, like grouping weight vectors and using level-3 BLAS for better performance. It shouldn't be a meta class, because many optimizations are specific. For example, LASSO can be solved via LARS, which computes a full solution path for all regularization parameters. The level-3 BLAS optimization is another example, which can give 8x speedup (SPARK-1486). [~vrilleup] We can have a set of built-in preconditions, like positivity. Or we could accept lambda function for assertions (T) = Unit, which may be hard for Java users but they should be familiar of creating those in Spark. was (Author: mengxr): [~eustache] The default implementation of multi-model training will be a for loop. But the API leaves space for future optimizations, like group weight vectors and using level-3 BLAS for better performance. It shouldn't be a meta class, because many optimizations are specific. For example, LASSO can be solved via LARS, which computes a full solution path for all regularization parameters. The level-3 BLAS optimization is another example, which can give 8x speedup (SPARK-1486). [~vrilleup] We can have a set of built-in preconditions, like positivity. Or we could accept lambda function for assertions (T) = Unit, which may be hard for Java users but they should be familiar of creating those in Spark. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3573: - Description: This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(iteractor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(raw, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} was: This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. #Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(iteractor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(raw, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events,
[jira] [Closed] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3560. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Fixed by https://github.com/apache/spark/pull/2449 In yarn-cluster mode, the same jars are distributed through multiple mechanisms. Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Priority: Critical Fix For: 1.1.1, 1.2.0 In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-3560: -- Assignee: Min Shen Reopening just to reassign. Closing right afterwards, please disregard. In yarn-cluster mode, the same jars are distributed through multiple mechanisms. Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Min Shen Priority: Critical Fix For: 1.1.1, 1.2.0 In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3560) In yarn-cluster mode, the same jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3560. Resolution: Fixed In yarn-cluster mode, the same jars are distributed through multiple mechanisms. Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Assignee: Min Shen Priority: Critical Fix For: 1.1.1, 1.2.0 In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3587) Spark SQL can't support lead() over() window function
[ https://issues.apache.org/jira/browse/SPARK-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3587: --- Labels: (was: features) Spark SQL can't support lead() over() window function - Key: SPARK-3587 URL: https://issues.apache.org/jira/browse/SPARK-3587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: operator system is suse 11.3, three node ,every node has 48GB MEM and 16 CPU Reporter: caoli Original Estimate: 504h Remaining Estimate: 504h When run the following SQL: select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by c.pastkkdate) as end_time from bay_car_test1 c in HIVE, it is OK, -but run in Spark SQL with HiveContext, it is error with: java.lang.RuntimeException: Unsupported language features in query: select c.plateno, c.platetype,c.bay_id as start_bay_id,c.pastkkdate as start_time,lead(c.bay_id) over(partition by c.plateno order by c.pastkkdate) as end_bay_id, lead(c.pastkkdate) over(partition by c.plateno order by c.pastkkdate) as end_time from bay_car_test1 c TOK_QUERY is Spark SQLcan't support lead() and over() function? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3574) Shuffle finish time always reported as -1
[ https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3574: --- Component/s: Spark Core Shuffle finish time always reported as -1 - Key: SPARK-3574 URL: https://issues.apache.org/jira/browse/SPARK-3574 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout shuffleFinishTime is always reported as -1. I think the way we fix this should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the shuffles finish, but when aggregating the metrics, only report shuffleFinishTime as something other than -1 when *all* of the shuffles have completed. [~sandyr], it looks like this was introduced in your recent patch to incrementally report metrics. Any chance you can fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2672) Support compression in wholeFile()
[ https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2672: --- Summary: Support compression in wholeFile() (was: support compressed file in wholeFile()) Support compression in wholeFile() -- Key: SPARK-2672 URL: https://issues.apache.org/jira/browse/SPARK-2672 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Original Estimate: 72h Remaining Estimate: 72h The wholeFile() can not read compressed files, it should be, just like textFile(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2761) Merge similar code paths in ExternalSorter and EAOM
[ https://issues.apache.org/jira/browse/SPARK-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2761: --- Component/s: Spark Core Merge similar code paths in ExternalSorter and EAOM --- Key: SPARK-2761 URL: https://issues.apache.org/jira/browse/SPARK-2761 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Right now there is quite a lot of duplicate code. It would be good to merge these somehow if we want to make changes to them in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3535: - Target Version/s: 1.1.1, 1.2.0 (was: 1.1.1) Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139763#comment-14139763 ] Brenden Matthews edited comment on SPARK-3535 at 9/18/14 11:58 PM: --- After some even further digging, I noticed the following error in the Mesos slave log: E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 20140813-050807-3852091146-5050-1861-231 running task 3 on status update for terminal task, destroying container:Collect failed: No cpus resource given I'll update my patch accordingly. was (Author: brenden): After some even futher digging, I noticed the following error in the Mesos slave log: E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 20140813-050807-3852091146-5050-1861-231 running task 3 on status update for terminal task, destroying container:Collect failed: No cpus resource given I'll update my patch accordingly. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139763#comment-14139763 ] Brenden Matthews commented on SPARK-3535: - After some even futher digging, I noticed the following error in the Mesos slave log: E0918 23:13:48.726176 130743 slave.cpp:2204] Failed to update resources for container ae051668-6cb8-4252-8b6e-cfbac38e7e5c of executor 20140813-050807-3852091146-5050-1861-231 running task 3 on status update for terminal task, destroying container:Collect failed: No cpus resource given I'll update my patch accordingly. Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3562) Periodic cleanup event logs
[ https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139786#comment-14139786 ] Matthew Farrellee commented on SPARK-3562: -- is logrotate an option for you? Periodic cleanup event logs --- Key: SPARK-3562 URL: https://issues.apache.org/jira/browse/SPARK-3562 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: xukun If we run spark application frequently, it will write many spark event log into spark.eventLog.dir. After a long time later, there will be many spark event log that we do not concern in the spark.eventLog.dir.Periodic cleanups will ensure that logs older than this duration will be forgotten. It is no need to clean logs by hands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3554) handle large dataset in closure of PySpark
[ https://issues.apache.org/jira/browse/SPARK-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3554. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2417 [https://github.com/apache/spark/pull/2417] handle large dataset in closure of PySpark -- Key: SPARK-3554 URL: https://issues.apache.org/jira/browse/SPARK-3554 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 Sometimes there are large dataset used in closure and user forget to use broadcast for it, then the serialized command will become huge. py4j can not handle large objects efficiently, we should compress the serialized command and user broadcast for it if it's huge. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3535) Spark on Mesos not correctly setting heap overhead
[ https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139845#comment-14139845 ] Vinod Kone commented on SPARK-3535: --- This can happen if the spark executor doesn't use any cpus (or memory) and there are no tasks running on it. Note that in the next release of Mesos, such an executor is not allowed to launch. https://issues.apache.org/jira/browse/MESOS-1807 Spark on Mesos not correctly setting heap overhead -- Key: SPARK-3535 URL: https://issues.apache.org/jira/browse/SPARK-3535 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark on Mesos does account for any memory overhead. The result is that tasks are OOM killed nearly 95% of the time. Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the executor memory for JVM overhead. For example, see: https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3597) MesosSchedulerBackend does not implement `killTask`
Brenden Matthews created SPARK-3597: --- Summary: MesosSchedulerBackend does not implement `killTask` Key: SPARK-3597 URL: https://issues.apache.org/jira/browse/SPARK-3597 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Fix For: 1.1.1 The MesosSchedulerBackend class does not implement `killTask`, and therefore results in exceptions like this: 14/09/19 01:52:53 ERROR TaskSetManager: Task 238 in stage 1.0 failed 4 times; aborting job 14/09/19 01:52:53 INFO TaskSchedulerImpl: Cancelling stage 1 14/09/19 01:52:53 INFO DAGScheduler: Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:194) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:192) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:192) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:185) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1197) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1197) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries
[ https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139902#comment-14139902 ] Shawn Guo commented on SPARK-3581: -- Yes, please. Thanks for clarification. RDD API(distinct/subtract) does not work for RDD of Dictionaries Key: SPARK-3581 URL: https://issues.apache.org/jira/browse/SPARK-3581 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.2, 1.1.0 Environment: Spark 1.0 1.1 JDK 1.6 Reporter: Shawn Guo Priority: Minor Construct a RDD of dictionaries(dictRDD), try to use the RDD API, RDD.distinct() or RDD.subtract(). {code:title=PySpark RDD API Test|borderStyle=solid} dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'})) dictRDD.distinct().collect() dictRDD.subtract(dictRDD).collect() {code} An error occurred while calling, TypeError: unhashable type: 'dict' I'm not sure if it is a bug or expected results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3598) cast to timestamp should be the same as hive
Adrian Wang created SPARK-3598: -- Summary: cast to timestamp should be the same as hive Key: SPARK-3598 URL: https://issues.apache.org/jira/browse/SPARK-3598 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang select cast(1000 as timestamp) from src limit 1; should return 1970-01-01 00:00:01 also, current implementation has bug when the time is before 1970-01-01 00:00:00 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139903#comment-14139903 ] Shawn Guo commented on SPARK-3321: -- Yes please, thanks for clarification. Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Minor *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
[jira] [Created] (SPARK-3599) Avoid loading and printing properties file content frequently
WangTaoTheTonic created SPARK-3599: -- Summary: Avoid loading and printing properties file content frequently Key: SPARK-3599 URL: https://issues.apache.org/jira/browse/SPARK-3599 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor When I use -v | -verbos in spark-submit, there prints lots of message about contents in properties file. After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I found the getDefaultSparkProperties method is invoked in three places, and every time we invoke it, we load properties from properties file, and print again if option -v used. We might should use a value instead of method when we use default properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries
[ https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-3581. Resolution: Not a Problem RDD API(distinct/subtract) does not work for RDD of Dictionaries Key: SPARK-3581 URL: https://issues.apache.org/jira/browse/SPARK-3581 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.2, 1.1.0 Environment: Spark 1.0 1.1 JDK 1.6 Reporter: Shawn Guo Priority: Minor Construct a RDD of dictionaries(dictRDD), try to use the RDD API, RDD.distinct() or RDD.subtract(). {code:title=PySpark RDD API Test|borderStyle=solid} dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King King', 'MOVIE_DIRECTOR': 'Peter Jackson'})) dictRDD.distinct().collect() dictRDD.subtract(dictRDD).collect() {code} An error occurred while calling, TypeError: unhashable type: 'dict' I'm not sure if it is a bug or expected results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-3321. Resolution: Not a Problem Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Minor *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456)
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139958#comment-14139958 ] Sandy Ryza commented on SPARK-3573: --- Currently SchemaRDD lives inside SQL. Would we move it to core if we plan to use it in components that aren't related to SQL? Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(iteractor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(raw, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139965#comment-14139965 ] Erik Erlandson commented on SPARK-3250: --- PR: https://github.com/apache/spark/pull/2455 [SPARK-3250] Implement Gap Sampling optimization for random sampling More Efficient Sampling --- Key: SPARK-3250 URL: https://issues.apache.org/jira/browse/SPARK-3250 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: RJ Nowling Sampling, as currently implemented in Spark, is an O\(n\) operation. A number of stochastic algorithms achieve speed ups by exploiting O\(k\) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140017#comment-14140017 ] Patrick Wendell commented on SPARK-3573: [~sandyr] This is a good question I'm not sure how easy it would be to decouple SchemaRDD from the other things inside of sql/core. This definitely doesn't need to depend on catalyst or on hive, but it might need to depend on the entire sql core. I've been thinking about whether this is bad to have a growing number cross-dependencies in the projects. Do you see specific drawbacks here if that becomes the case? Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(iteractor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(raw, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs
[ https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140023#comment-14140023 ] David Rosenstrauch commented on SPARK-2058: --- I'm wondering the same: has this fix made it into a release? SPARK_CONF_DIR should override all present configs -- Key: SPARK-2058 URL: https://issues.apache.org/jira/browse/SPARK-2058 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.0.1, 1.1.0 Reporter: Eugen Cepoi Priority: Critical When the user defines SPARK_CONF_DIR I think spark should use all the configs available there not only spark-env. This involves changing SparkSubmitArguments to first read from SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the computed classpath for configs such as log4j, metrics, etc. I have already prepared a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3270) Spark API for Application Extensions
[ https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140043#comment-14140043 ] Patrick Wendell commented on SPARK-3270: Hey There, For the particular use case here - long-lived components that want to co-exist with executors and share the executor lifecycle - applications can enact this using static objects. I.e. you lazily initialize some service when accessed from within tasks. Then when the executor terminate the object goes away. Are there specific things preventing that approach in this case? The reason I ask is that adding a generic service discovery mechanism to Spark, while useful, is a fairly large new interface, and I'd guess its something where we'd want to look at a few specific different applications to find the common set of needs to make sure we are adding the right API's. Spark API for Application Extensions Key: SPARK-3270 URL: https://issues.apache.org/jira/browse/SPARK-3270 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Michal Malohlava Any application should be able to enrich spark infrastructure by services which are not available by default. Hence, to support such application extensions (aka extesions/plugins) Spark platform should provide: - an API to register an extension - an API to register a service (meaning provided functionality) - well-defined points in Spark infrastructure which can be enriched/hooked by extension - a way of deploying extension (for example, simply putting the extension on classpath and using Java service interface) - a way to access extension from application Overall proposal is available here: https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing Note: In this context, I do not mean reinventing OSGi (or another plugin platform) but it can serve as a good starting point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org