[jira] [Created] (SPARK-10542) The PySpark 1.5 closure serializer can't serialize a namedtuple instance.
Davies Liu created SPARK-10542: -- Summary: The PySpark 1.5 closure serializer can't serialize a namedtuple instance. Key: SPARK-10542 URL: https://issues.apache.org/jira/browse/SPARK-10542 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Code to Reproduce Bug: {code} from collections import namedtuple PowerPlantRow=namedtuple("PowerPlantRow", ["AT", "V", "AP", "RH", "PE"]) rdd=sc.parallelize([1]).map(lambda x: PowerPlantRow(1.0, 2.0, 3.0, 4.0, 5.0)) rdd.count() {code} Error message on Spark 1.5: {code} AttributeError: 'builtin_function_or_method' object has no attribute '__code__' --- AttributeErrorTraceback (most recent call last) in () 2 PowerPlantRow=namedtuple("PowerPlantRow", ["AT", "V", "AP", "RH", "PE"]) 3 rdd=sc.parallelize([1]).map(lambda x: PowerPlantRow(1.0, 2.0, 3.0, 4.0, 5.0)) > 4 rdd.count() /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in count(self) 1004 3 1005 """ -> 1006 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 1007 1008 def stats(self): /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in sum(self) 995 6.0 996 """ --> 997 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 998 999 def count(self): /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op) 869 # zeroValue provided to each partition is unique from the one provided 870 # to the final reduce call --> 871 vals = self.mapPartitions(func).collect() 872 return reduce(op, vals, zeroValue) 873 /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in collect(self) 771 """ 772 with SCCallSiteSync(self.context) as css: --> 773 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 774 return list(_load_from_socket(port, self._jrdd_deserializer)) 775 /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in _jrdd(self) 2383 command = (self.func, profiler, self._prev_jrdd_deserializer, 2384self._jrdd_deserializer) -> 2385 pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self) 2386 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(), 2387 bytearray(pickled_cmd), /home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in _prepare_for_python_RDD(sc, command, obj) 2303 # the serialized command will be compressed by broadcast 2304 ser = CloudPickleSerializer() -> 2305 pickled_command = ser.dumps(command) 2306 if len(pickled_command) > (1 << 20): # 1M 2307 # The broadcast will have same life cycle as created PythonRDD /home/ubuntu/databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj) 425 426 def dumps(self, obj): --> 427 return cloudpickle.dumps(obj, 2) 428 429 /home/ubuntu/databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, protocol) 639 640 cp = CloudPickler(file,protocol) --> 641 cp.dump(obj) 642 643 return file.getvalue() /home/ubuntu/databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj) 105 self.inject_addons() 106 try: --> 107 return Pickler.dump(self, obj) 108 except RuntimeError as e: 109 if 'recursion' in e.args[0]: /usr/lib/python2.7/pickle.pyc in dump(self, obj) 222 if self.proto >= 2: 223 self.write(PROTO + chr(self.proto)) --> 224 self.save(obj) 225 self.write(STOP) 226 /usr/lib/python2.7/pickle.pyc in save(self, obj) 284 f = self.dispatch.get(t) 285 if f: --> 286 f(self, obj) # Call unbound method with explicit self 287 return 288 /usr/lib/python2.7/pickle.pyc in save_tuple(self, obj) 560 write(MARK) 561 for element in obj: --> 562 save(element) 563 564 if id(obj) in memo: /usr/lib/python2.7/pickle.pyc in save(self, obj) 284 f = self.dispatch.get(t) 285 if f: --> 286 f(self, obj) # Call unbound method with explicit self 287 return 288 ... skipped 23125 bytes ... 650 651 dispatch[DictionaryType] = save_dict /usr/lib/python2.7/pickle.pyc in _batch_setitems(self, items) 684 k, v = tmp[0] 685 save(k) --> 686 save(v) 687 write(SETITEM) 688 # else t
[jira] [Resolved] (SPARK-6931) python: struct.pack('!q', value) in write_long(value, stream) in serializers.py require int(but doesn't raise exceptions in common cases)
[ https://issues.apache.org/jira/browse/SPARK-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6931. --- Resolution: Fixed Fix Version/s: 1.2.3 1.3.2 Issue resolved by pull request 8594 [https://github.com/apache/spark/pull/8594] > python: struct.pack('!q', value) in write_long(value, stream) in > serializers.py require int(but doesn't raise exceptions in common cases) > - > > Key: SPARK-6931 > URL: https://issues.apache.org/jira/browse/SPARK-6931 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 >Reporter: Chunxi Zhang >Priority: Critical > Labels: easyfix > Fix For: 1.3.2, 1.2.3 > > Original Estimate: 1h > Remaining Estimate: 1h > > when I map my own feature calculation module's function, sparks raises: > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/daemon.py", line > 162, in manager > code = worker(sock) > File > "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/daemon.py", line > 60, in worker > worker_main(infile, outfile) > File > "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/worker.py", line > 115, in main > report_times(outfile, boot_time, init_time, finish_time) > File > "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/worker.py", line > 40, in report_times > write_long(1000 * boot, outfile) > File > "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/serializers.py", > line 518, in write_long > stream.write(struct.pack("!q", value)) > DeprecationWarning: integer argument expected, got float > so I turn on the serializers.py, and tried to print the value out, which is a > float, came from 1000 * time.time() > when I removed my lib, or add a rdd.count() before mapping my lib, this bug > won’t appear. > so I edited the function to : > def write_long(value, stream): > stream.write(struct.pack("!q", int(value))) # added a int(value) > everything seem fine… > According to python’s doc for > struct(https://docs.python.org/2/library/struct.html)’s Note(3), the value > should be a int(for q), and if it’s a float, it’ll try use __index__(), else, > try __int__, but since __int__ is deprecated, it’ll raise DeprecationWarning. > And float doesn’t have __index__, but has __int__, so it should raise the > exception every time. > But, as you can see, in normal cases, it won’t raise the exception, and the > code works perfectly, and exec struct.pack('!q', 111.1) in console or a clean > file won't raise any exception…I can hardly tell how my lib might effect a > time.time()'s value passed to struct.pack()... it might a python's original > bug or what. > Anyway, this value should be a int, so add a int() to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10065) Avoid triple copy of var-length objects in Array in tungsten projection
[ https://issues.apache.org/jira/browse/SPARK-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10065. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8496 [https://github.com/apache/spark/pull/8496] > Avoid triple copy of var-length objects in Array in tungsten projection > --- > > Key: SPARK-10065 > URL: https://issues.apache.org/jira/browse/SPARK-10065 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > Fix For: 1.6.0 > > > The first copy happens when we calculate the size of each element, after > that, we copy the elements into array buffer, finally we copy the array > buffer into row buffer. > We could calculate the total size first, then convert the elements into row > buffer directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9730) Sort Merge Join for Full Outer Join
[ https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9730. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8579 [https://github.com/apache/spark/pull/8579] > Sort Merge Join for Full Outer Join > --- > > Key: SPARK-9730 > URL: https://issues.apache.org/jira/browse/SPARK-9730 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Josh Rosen > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet
Davies Liu created SPARK-10522: -- Summary: Nanoseconds part of Timestamp should be positive in parquet Key: SPARK-10522 URL: https://issues.apache.org/jira/browse/SPARK-10522 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu If Timestamp is before unix epoch, the nanosecond part will be negative, Hive can't read that back correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values
[ https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737644#comment-14737644 ] Davies Liu commented on SPARK-10439: There are many places there could be overflow, even for A + B, so I think it's not big deal. If we really want to handle them gracefully, those bound checking should be performed during inbound, turn them into null if overflow, not crash (raise exception). > Catalyst should check for overflow / underflow of date and timestamp values > --- > > Key: SPARK-10439 > URL: https://issues.apache.org/jira/browse/SPARK-10439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Priority: Minor > > While testing some code, I noticed that a few methods in {{DateTimeUtils}} > are prone to overflow and underflow. > For example, {{millisToDays}} can overflow the return type ({{Int}}) if a > large enough input value is provided. > Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which > can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the > negative case). > There might be others but these were the ones that caught my eye. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10474: --- Target Version/s: 1.6.0, 1.5.1 Priority: Blocker (was: Critical) > Aggregation failed with unable to acquire memory > > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yi Zhou >Priority: Blocker > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON
[ https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737375#comment-14737375 ] Davies Liu commented on SPARK-10519: +1 for 3, user have the ability to control timezone, it's also compatible. > Investigate if we should encode timezone information to a timestamp value > stored in JSON > > > Key: SPARK-10519 > URL: https://issues.apache.org/jira/browse/SPARK-10519 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > Since Spark 1.3, we store a timestamp in JSON without encoding the timezone > information and the string representation of a timestamp stored in JSON > implicitly using the local timezone (see > [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454], > > [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38], > > [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41], > > [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]). > This behavior may cause the data consumers got different values when they > are in a different timezone with the data producers. > Since JSON is string based, if we encode timezone information to timestamp > value, downstream applications may need to change their code (for example, > java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d > hh:mm:ss\[.f...]}}). > We should investigate what we should do about this issue. Right now, I can > think of three options: > 1. Encoding timezone info in the timestamp value, which can break user code > and may change the semantic of timestamp (our timestamp value is > timezone-less). > 2. When saving a timestamp value to json, we treat this value as a value in > the local timezone and convert it to UTC time. Then, when save the data, we > do not encode timezone info in the value. > 3. We do not change our current behavior. But, in our doc, we explicitly say > that users need to use a single timezone for their datasets (e.g. always use > UTC time). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10461) make sure `input.primitive` is always variable name not code at GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10461. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8613 [https://github.com/apache/spark/pull/8613] > make sure `input.primitive` is always variable name not code at > GenerateUnsafeProjection > > > Key: SPARK-10461 > URL: https://issues.apache.org/jira/browse/SPARK-10461 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185 ] Davies Liu commented on SPARK-10309: [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.eplain()` > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185 ] Davies Liu edited comment on SPARK-10309 at 9/9/15 4:53 PM: [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.explain()` was (Author: davies): [~nadenf] Thanks for letting us know, just realized that your stacktrace already including that fix. Maybe there are multiple join/aggregation/sort in your query? You can show the physical plan by `df.eplain()` > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-10512. -- Resolution: Won't Fix > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc
[ https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736961#comment-14736961 ] Davies Liu commented on SPARK-10512: As we discussed here https://github.com/apache/spark/pull/8657#discussion_r38992400, we should add a doc for those public API, instead putting a workaround in @since. > Fix @since when a function doesn't have doc > --- > > Key: SPARK-10512 > URL: https://issues.apache.org/jira/browse/SPARK-10512 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Yu Ishikawa > > When I tried to add @since to a function which doesn't have doc, @since > didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} > decorator. > {noformat} > Traceback (most recent call last): > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 122, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line > 34, in _run_code > exec code in run_globals > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 46, in > class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, > JavaLoader): > File > "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py", > line 166, in MatrixFactorizationModel > @since("1.3.1") > File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", > line 63, in deco > indents = indent_p.findall(f.__doc__) > TypeError: expected string or buffer > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735908#comment-14735908 ] Davies Liu commented on SPARK-10309: This also could be related to https://issues.apache.org/jira/browse/SPARK-10341?filter=-2, could you test with 1.5-RC3? > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735903#comment-14735903 ] Davies Liu commented on SPARK-10309: [~nadenf] Could you post the physical plan here? That could help us to understand the root cause. > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill
[ https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735776#comment-14735776 ] Davies Liu commented on SPARK-10466: [~chenghao] I tried your test case, it passed in master. Is there other things I need to reproduce the failure? > UnsafeRow exception in Sort-Based Shuffle with data spill > -- > > Key: SPARK-10466 > URL: https://issues.apache.org/jira/browse/SPARK-10466 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Blocker > > In sort-based shuffle, if we have data spill, it will cause assert exception, > the follow code can reproduce that > {code} > withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) { > withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) { > withTempTable("mytemp") { > sparkContext.parallelize(1 to 1000, 3).map(i => (i, > i)).toDF("key", "value").registerTempTable("mytemp") > sql("select key, value as v1 from mytemp where key > > 1").registerTempTable("l") > sql("select key, value as v2 from mytemp where key > > 3").registerTempTable("r") > val df3 = sql("select v1, v2 from l left join r on l.key=r.key") > df3.count() > } > } > } > {code} > {code} > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in > stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.sched
[jira] [Created] (SPARK-10494) Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)
Davies Liu created SPARK-10494: -- Summary: Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory) Key: SPARK-10494 URL: https://issues.apache.org/jira/browse/SPARK-10494 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Priority: Critical The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one query stage could end up evaluate upstream (SparkPlan) 2^N times. In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, they will cause OOM (failed to acquire memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735644#comment-14735644 ] Davies Liu commented on SPARK-8632: --- The upstream means child of current SparkPlan, could have other Python UDFs. We remove the RDD cache in 1.4, then the upstream will be evaluated twice. If you have multiple Python UDFs, for example three, it will end up evaluate the child 8 times (2 x 2 x 2), which will be really slow or cause OOM. In synchronous batch mode, what's the batch size? if it's small, the overhead of each batch will be high, if it's too large, it's easy to OOM if you have many columns. Also we need to copy the rows (serialization is not need if it's UnsafeRow). > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735401#comment-14735401 ] Davies Liu commented on SPARK-10309: [~nadenf] In my case, the job finally finished (after retry), so this seems to be a blocker for me. Could you provide more information about you job? > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735392#comment-14735392 ] Davies Liu commented on SPARK-8632: --- [~rxin] As [~justin.uang] suggested before, the batch mode will need to flush the rows in every place of the pipeline, or it get deadlock. I think the goal is to call upstream once and improve the throughput of Python UDF (which is usually the bottleneck). The batch mode is increase the overhead of Python UDF (for each batch), cause worser performance. The problem of older cache is that serialization and memory management (also not purged after used) overhead. With one time (purged after visited) tungsten cache (and spilling), the overhead should be not that high, I think this should be the most performant and stable approach. > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-8632: - Assignee: Davies Liu > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang >Assignee: Davies Liu > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow
Davies Liu created SPARK-10459: -- Summary: PythonUDF could process UnsafeRow Key: SPARK-10459 URL: https://issues.apache.org/jira/browse/SPARK-10459 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Currently, There will be ConvertToSafe for PythonUDF, that's not needed actually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename
[ https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10436: --- Target Version/s: 1.6.0 > spark-submit overwrites spark.files defaults with the job script filename > - > > Key: SPARK-10436 > URL: https://issues.apache.org/jira/browse/SPARK-10436 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.0 > Environment: Ubuntu, Spark 1.4.0 Standalone >Reporter: axel dahl >Priority: Minor > Labels: easyfix, feature > > In my spark-defaults.conf I have configured a set of libararies to be > uploaded to my Spark 1.4.0 Standalone cluster. The entry appears as: > spark.files libarary.zip,file1.py,file2.py > When I execute spark-submit -v test.py > I see that spark-submit reads the defaults correctly, but that it overwrites > the "spark.files" default entry and replaces it with the name if the job > script, i.e. "test.py". > This behavior doesn't seem intuitive. test.py, should be added to the spark > working folder, but it should not overwrite the "spark.files" defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls
[ https://issues.apache.org/jira/browse/SPARK-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10434: --- Priority: Minor (was: Critical) > Parquet compatibility with 1.4 is broken when writing arrays that may contain > nulls > --- > > Key: SPARK-10434 > URL: https://issues.apache.org/jira/browse/SPARK-10434 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > When writing arrays that may contain nulls, for example: > {noformat} > StructType( > StructField( > "f", > ArrayType(IntegerType, containsNull = true), > nullable = false)) > {noformat} > Spark 1.4 uses the following schema: > {noformat} > message m { > required group f (LIST) { > repeated group bag { > optional int32 array; > } > } > } > {noformat} > This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level > structure and repeated group name "bag" are borrowed from parquet-hive, while > the innermost element field name "array" is borrowed from parquet-avro. > However, in Spark 1.5, I failed to notice the latter fact and used a schema > in purely parquet-hive flavor, namely: > {noformat} > message m { > required group f (LIST) { > repeated group bag { > optional int32 array_element; > } > } > } > {noformat} > One of the direct consequence is that, Parquet files containing such array > fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements > become null). > To fix this issue, the name of the innermost field should be changed back to > "array". Notice that this fix doesn't affect interoperability with Hive > (saving Parquet files using {{saveAsTable()}} and then read them using Hive). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls
[ https://issues.apache.org/jira/browse/SPARK-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729445#comment-14729445 ] Davies Liu commented on SPARK-10434: [~lian cheng] I think it's hard to guarantee forward computability (former version can read data generated by new version), do we really need this change? > Parquet compatibility with 1.4 is broken when writing arrays that may contain > nulls > --- > > Key: SPARK-10434 > URL: https://issues.apache.org/jira/browse/SPARK-10434 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > When writing arrays that may contain nulls, for example: > {noformat} > StructType( > StructField( > "f", > ArrayType(IntegerType, containsNull = true), > nullable = false)) > {noformat} > Spark 1.4 uses the following schema: > {noformat} > message m { > required group f (LIST) { > repeated group bag { > optional int32 array; > } > } > } > {noformat} > This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level > structure and repeated group name "bag" are borrowed from parquet-hive, while > the innermost element field name "array" is borrowed from parquet-avro. > However, in Spark 1.5, I failed to notice the latter fact and used a schema > in purely parquet-hive flavor, namely: > {noformat} > message m { > required group f (LIST) { > repeated group bag { > optional int32 array_element; > } > } > } > {noformat} > One of the direct consequence is that, Parquet files containing such array > fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements > become null). > To fix this issue, the name of the innermost field should be changed back to > "array". Notice that this fix doesn't affect interoperability with Hive > (saving Parquet files using {{saveAsTable()}} and then read them using Hive). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10425) Add a regression test for SPARK-10379
[ https://issues.apache.org/jira/browse/SPARK-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729421#comment-14729421 ] Davies Liu commented on SPARK-10425: [~sowen] Thanks for your comment, The reason that PR didn't have a regression test is that I can't reproduce the failure after many tries (difference scale and workload). That's something need to be done obviously, also failed in user's workload. This JIRA is requested by reviewer of that PR, I'm fine to close it. cc [~andrewor14] > Add a regression test for SPARK-10379 > - > > Key: SPARK-10425 > URL: https://issues.apache.org/jira/browse/SPARK-10425 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10422) String column in InMemoryColumnarCache needs to override clone method
[ https://issues.apache.org/jira/browse/SPARK-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10422. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8578 [https://github.com/apache/spark/pull/8578] > String column in InMemoryColumnarCache needs to override clone method > - > > Key: SPARK-10422 > URL: https://issues.apache.org/jira/browse/SPARK-10422 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.5.0 > > > We have a clone method in {{ColumnType}} > (https://github.com/apache/spark/blob/v1.5.0-rc3/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala#L103). > Seems we need to override it for String > (https://github.com/apache/spark/blob/v1.5.0-rc3/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala#L314) > because we are dealing with UTF8String. > {code} > val df = > ctx.range(1, 3).selectExpr("id % 500 as id").rdd.map(id => > Tuple1(s"str_$id")).toDF("i") > val cached = df.cache() > cached.count() > [info] - SPARK-10422: String column in InMemoryColumnarCache needs to > override clone method *** FAILED *** (9 seconds, 152 milliseconds) > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.util.NoSuchElementException: key not > found: str_[0] > [info]at scala.collection.MapLike$class.default(MapLike.scala:228) > [info]at scala.collection.AbstractMap.default(Map.scala:58) > [info]at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > [info]at > org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) > [info]at > org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) > [info]at > org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) > [info]at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152) > [info]at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152) > [info]at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > [info]at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > [info]at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > [info]at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > [info]at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > [info]at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > [info]at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152) > [info]at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10425) Add a regression test for SPARK-10379
Davies Liu created SPARK-10425: -- Summary: Add a regression test for SPARK-10379 Key: SPARK-10425 URL: https://issues.apache.org/jira/browse/SPARK-10425 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10424) ShuffleHashOuterJoin should consider condition
[ https://issues.apache.org/jira/browse/SPARK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10424: --- Priority: Blocker (was: Major) > ShuffleHashOuterJoin should consider condition > -- > > Key: SPARK-10424 > URL: https://issues.apache.org/jira/browse/SPARK-10424 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Priority: Blocker > > Currently, ShuffleHashOuterJoin does not consider condition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10424) ShuffleHashOuterJoin should consider condition
Davies Liu created SPARK-10424: -- Summary: ShuffleHashOuterJoin should consider condition Key: SPARK-10424 URL: https://issues.apache.org/jira/browse/SPARK-10424 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Currently, ShuffleHashOuterJoin does not consider condition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10417) Iterating through Column results in infinite loop
[ https://issues.apache.org/jira/browse/SPARK-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10417. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8574 [https://github.com/apache/spark/pull/8574] > Iterating through Column results in infinite loop > - > > Key: SPARK-10417 > URL: https://issues.apache.org/jira/browse/SPARK-10417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 >Reporter: Zoltan Toth >Priority: Minor > Fix For: 1.6.0 > > > Iterating through a _Column_ object results in an infinite loop. > {code} > df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) > for i in df["name"]: print i > {code} > Result: > {code} > Column > Column > Column > Column > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10392: --- Fix Version/s: 1.5.1 > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > Fix For: 1.6.0, 1.5.1 > > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10392. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8556 [https://github.com/apache/spark/pull/8556] > Pyspark - Wrong DateType support on JDBC connection > --- > > Key: SPARK-10392 > URL: https://issues.apache.org/jira/browse/SPARK-10392 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.1 >Reporter: Maciej Bryński > Fix For: 1.6.0 > > > I have following problem. > I created table. > {code} > CREATE TABLE `spark_test` ( > `id` INT(11) NULL, > `date` DATE NULL > ) > COLLATE='utf8_general_ci' > ENGINE=InnoDB > ; > INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01'); > {code} > Then I'm trying to read data - date '1970-01-01' is converted to int. This > makes data frame incompatible with its own schema. > {code} > df = > sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", > 'spark_test') > print(df.collect()) > df = sqlCtx.createDataFrame(df.rdd, df.schema) > [Row(id=1, date=0)] > --- > TypeError Traceback (most recent call last) > in () > 1 df = > sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4", > 'spark_test') > 2 print(df.collect()) > > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema) > /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, > schema, samplingRatio) > 402 > 403 if isinstance(data, RDD): > --> 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > 406 rdd, schema = self._createFromLocal(data, schema) > /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, > schema, samplingRatio) > 296 rows = rdd.take(10) > 297 for row in rows: > --> 298 _verify_type(row, schema) > 299 > 300 else: > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1152 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1153 for v, f in zip(obj, dataType.fields): > -> 1154 _verify_type(v, f.dataType) >1155 >1156 > /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType) >1136 # subclass of them can not be fromInternald in JVM >1137 if type(obj) not in _acceptable_types[_type]: > -> 1138 raise TypeError("%s can not accept object in type %s" % > (dataType, type(obj))) >1139 >1140 if isinstance(dataType, ArrayType): > TypeError: DateType can not accept object in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.
[ https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10162. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8555 [https://github.com/apache/spark/pull/8555] > PySpark filters with datetimes mess up when datetimes have timezones. > - > > Key: SPARK-10162 > URL: https://issues.apache.org/jira/browse/SPARK-10162 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Kevin Cox > Fix For: 1.6.0 > > > PySpark appears to ignore timezone information when filtering on (and working > in general with) datetimes. > Please see the example below. The generated filter in the query plan is 5 > hours off (my computer is EST). > {code} > In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", > TimestampType())])) > In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain() > Filter (dt#9 > 9467028) > Scan PhysicalRDD[dt#9] > {code} > Note that 9467028 == Sat 1 Jan 2000 05:00:00 UTC -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10404) Worker should terminate previous executor before launch new one
Davies Liu created SPARK-10404: -- Summary: Worker should terminate previous executor before launch new one Key: SPARK-10404 URL: https://issues.apache.org/jira/browse/SPARK-10404 Project: Spark Issue Type: Bug Reporter: Davies Liu Reported here: http://apache-spark-user-list.1001560.n3.nabble.com/Hung-spark-executors-don-t-count-toward-worker-memory-limit-td16083.html#a24548 If new launched executor is overlapped with previous ones, they could run out of memory in the machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page
[ https://issues.apache.org/jira/browse/SPARK-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10379: --- Target Version/s: 1.6.0, 1.5.1 (was: 1.5.0) > UnsafeShuffleExternalSorter should preserve first page > -- > > Key: SPARK-10379 > URL: https://issues.apache.org/jira/browse/SPARK-10379 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > {code} > 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID > 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable > to acquire 67108864 bytes of memory > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)
Davies Liu created SPARK-10403: -- Summary: UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort) Key: SPARK-10403 URL: https://issues.apache.org/jira/browse/SPARK-10403 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not write EOF between partitions. {code} java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174) at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page
Davies Liu created SPARK-10379: -- Summary: UnsafeShuffleExternalSorter should preserve first page Key: SPARK-10379 URL: https://issues.apache.org/jira/browse/SPARK-10379 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Critical {code} 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable to acquire 67108864 bytes of memory at org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385) at org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435) at org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) at org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10373) Move @since annotator to pyspark to be shared by all components
[ https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724009#comment-14724009 ] Davies Liu commented on SPARK-10373: [~mengxr] Do we want to add @since for the MLLib APIs in 1.5 docs ? > Move @since annotator to pyspark to be shared by all components > --- > > Key: SPARK-10373 > URL: https://issues.apache.org/jira/browse/SPARK-10373 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Davies Liu > > Python's `@since` is defined under `pyspark.sql`. It would be nice to move it > under `pyspark` to be shared by all components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate
Davies Liu created SPARK-10345: -- Summary: Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate Key: SPARK-10345 URL: https://issues.apache.org/jira/browse/SPARK-10345 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41759/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/nonblock_op_deduplicate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10341) SMJ fail with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10341: --- Target Version/s: 1.5.0 (was: 1.5.1) > SMJ fail with unable to acquire memory > -- > > Key: SPARK-10341 > URL: https://issues.apache.org/jira/browse/SPARK-10341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > In SMJ, the first ExternalSorter could consume all the memory before > spilling, then the second can not even acquire the first page. > {code} > ava.io.IOException: Unable to acquire 16777216 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10323) NPE in code-gened In expression
[ https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10323. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8492 [https://github.com/apache/spark/pull/8492] > NPE in code-gened In expression > --- > > Key: SPARK-10323 > URL: https://issues.apache.org/jira/browse/SPARK-10323 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > To reproduce the problem, you can run {{null in ('str')}}. Let's also take a > look InSet and other similar expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10343) Consider nullability of expression in codegen
Davies Liu created SPARK-10343: -- Summary: Consider nullability of expression in codegen Key: SPARK-10343 URL: https://issues.apache.org/jira/browse/SPARK-10343 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Priority: Critical In codegen, we didn't consider nullability of expressions. Once considering this, we can avoid lots of null check (reduce the size of generated code, also improve performance). Before that, we should double-check the correctness of nullablity of all expressions and schema, or we will hit NPE or wrong results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10342: --- Issue Type: Improvement (was: Story) > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10342) Cooperative memory management
Davies Liu created SPARK-10342: -- Summary: Cooperative memory management Key: SPARK-10342 URL: https://issues.apache.org/jira/browse/SPARK-10342 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical We have memory starving problems for a long time, it become worser in 1.5 since we use larger page. In order to increase the memory usage (reduce unnecessary spilling) also reduce the risk of OOM, we should manage the memory in a cooperative way, it means all the memory consume should be also responsive to release memory (spilling) upon others' requests. The requests of memory could be different, hard requirement (will crash if not allocated) or soft requirement (worse performance if not allocated). Also the costs of spilling are also different. We could introduce some kind of priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10341) SMJ fail with unable to acquire memory
Davies Liu created SPARK-10341: -- Summary: SMJ fail with unable to acquire memory Key: SPARK-10341 URL: https://issues.apache.org/jira/browse/SPARK-10341 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page. {code} ava.io.IOException: Unable to acquire 16777216 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-10321: -- Assignee: Davies Liu > OrcRelation doesn't override sizeInBytes > > > Key: SPARK-10321 > URL: https://issues.apache.org/jira/browse/SPARK-10321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Critical > > This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10309) Some tasks failed with Unable to acquire memory
Davies Liu created SPARK-10309: -- Summary: Some tasks failed with Unable to acquire memory Key: SPARK-10309 URL: https://issues.apache.org/jira/browse/SPARK-10309 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on executor): {code} java.io.IOException: Unable to acquire 33554432 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} The task could finished after retry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10305) PySpark createDataFrame on list of LabeledPoints fails (regression)
[ https://issues.apache.org/jira/browse/SPARK-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10305. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8470 [https://github.com/apache/spark/pull/8470] > PySpark createDataFrame on list of LabeledPoints fails (regression) > --- > > Key: SPARK-10305 > URL: https://issues.apache.org/jira/browse/SPARK-10305 > Project: Spark > Issue Type: Bug > Components: ML, PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Critical > Fix For: 1.5.0 > > > The following code works in 1.4 but fails in 1.5: > {code} > import numpy as np > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.linalg import Vectors > lp1 = LabeledPoint(1.0, Vectors.sparse(5, np.array([0, 1]), np.array([2.0, > 21.0]))) > lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), np.array([2.0, > 21.0]))) > tmp = [lp1, lp2] > sqlContext.createDataFrame(tmp).show() > {code} > The failure is: > {code} > ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with > StructType > --- > ValueErrorTraceback (most recent call last) > in () > 6 lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), > np.array([2.0, 21.0]))) > 7 tmp = [lp1, lp2] > > 8 sqlContext.createDataFrame(tmp).show() > /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in > createDataFrame(self, data, schema, samplingRatio) > 404 rdd, schema = self._createFromRDD(data, schema, > samplingRatio) > 405 else: > --> 406 rdd, schema = self._createFromLocal(data, schema) > 407 jrdd = > self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) > 408 jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), > schema.json()) > /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in > _createFromLocal(self, data, schema) > 335 > 336 # convert python objects to sql data > --> 337 data = [schema.toInternal(row) for row in data] > 338 return self._sc.parallelize(data), schema > 339 > /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in > toInternal(self, obj) > 539 return tuple(f.toInternal(v) for f, v in > zip(self.fields, obj)) > 540 else: > --> 541 raise ValueError("Unexpected tuple %r with > StructType" % obj) > 542 else: > 543 if isinstance(obj, dict): > ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with > StructType > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10302) NPE while save a DataFrame as ORC
[ https://issues.apache.org/jira/browse/SPARK-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-10302. -- Resolution: Duplicate Fix Version/s: 1.5.0 > NPE while save a DataFrame as ORC > - > > Key: SPARK-10302 > URL: https://issues.apache.org/jira/browse/SPARK-10302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > {code} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:130) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240) > ... 8 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10302) NPE while save a DataFrame as ORC
Davies Liu created SPARK-10302: -- Summary: NPE while save a DataFrame as ORC Key: SPARK-10302 URL: https://issues.apache.org/jira/browse/SPARK-10302 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Priority: Critical {code} Caused by: java.lang.NullPointerException at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377) at org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:130) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240) ... 8 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715512#comment-14715512 ] Davies Liu commented on SPARK-9228: --- [~jameszhouyi] unsafe.offHeap is another option that is separated to `sql.tungsten.enable`, which specify how tungsten manage the memory, on heap (default), or off heap. > Combine unsafe and codegen into a single option > --- > > Key: SPARK-9228 > URL: https://issues.apache.org/jira/browse/SPARK-9228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10245) SQLContext can't parse literal less than 0.1 ( 0.01)
Davies Liu created SPARK-10245: -- Summary: SQLContext can't parse literal less than 0.1 ( 0.01) Key: SPARK-10245 URL: https://issues.apache.org/jira/browse/SPARK-10245 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker {code} scala> sqlCtx.sql("select 0.01") org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.PrecisionInfo.(DecimalType.scala:32) at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:68) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:40) at org.apache.spark.sql.catalyst.SqlParser$$anonfun$numericLiteral$2$$anonfun$apply$216.apply(SqlParser.scala:335) at org.apache.spark.sql.catalyst.SqlParser$$anonfun$numericLiteral$2$$anonfun$apply$216.apply(SqlParser.scala:334) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10196) Failed to save json data with a decimal type in the schema
[ https://issues.apache.org/jira/browse/SPARK-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10196. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8408 [https://github.com/apache/spark/pull/8408] > Failed to save json data with a decimal type in the schema > -- > > Key: SPARK-10196 > URL: https://issues.apache.org/jira/browse/SPARK-10196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.5.0 > > > I try to save a dataset with a decimal type in it and I got > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 7 in stage 19.0 failed 4 times, most recent failure: Lost task 7.3 in > stage 19.0 (TID 932, 10.0.243.5): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:391) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: (DecimalType(7,2),40.74) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:334) > ... 8 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818) > at org.apache.spark.SparkContext.r
[jira] [Commented] (SPARK-10215) Div of Decimal returns null
[ https://issues.apache.org/jira/browse/SPARK-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710696#comment-14710696 ] Davies Liu commented on SPARK-10215: I think we have not enough time to figure out the right solution for 1.5 release, I'd like to target this for 1.6, dose this work for you? Or does this break any real use case? > Div of Decimal returns null > --- > > Key: SPARK-10215 > URL: https://issues.apache.org/jira/browse/SPARK-10215 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Blocker > > {code} > val d = Decimal(1.12321) > val df = Seq((d, 1)).toDF("a", "b") > df.selectExpr("b * a / b").collect() => Array(Row(null)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8580) Test Parquet interoperability and compatibility with other libraries/systems
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8580. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8392 [https://github.com/apache/spark/pull/8392] > Test Parquet interoperability and compatibility with other libraries/systems > > > Key: SPARK-8580 > URL: https://issues.apache.org/jira/browse/SPARK-8580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 1.5.0 > > > As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 > to improve interoperability with other systems (reading non-standard Parquet > files they generate, and generating standard Parquet files), it would be good > to have a set of standard test Parquet files generated by various > systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old > versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-7506. - Resolution: Won't Fix Assignee: (was: Davies Liu) These functions could be used by users (even we don't hope to), they can't be changed easily without breaking compatibility. > pyspark.sql.types.StructType.fromJson() is incorrectly named > > > Key: SPARK-7506 > URL: https://issues.apache.org/jira/browse/SPARK-7506 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 >Reporter: Nicholas Chammas >Priority: Minor > > {code} > >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}'])) > >>> json_rdd.schema > StructType(List(StructField(name,StringType,true))) > >>> type(json_rdd.schema) > > >>> json_rdd.schema.json() > '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}' > >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json()) > Traceback (most recent call last): > File "", line 1, in > File > "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py", > line 346, in fromJson > return StructType([StructField.fromJson(f) for f in json["fields"]]) > TypeError: string indices must be integers, not str > >>> import json > >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json())) > StructType(List(StructField(name,StringType,true))) > >>> > {code} > So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects > a dictionary. > This method should probably be renamed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10177) Parquet support interprets timestamp values differently from Hive 0.14.0+
[ https://issues.apache.org/jira/browse/SPARK-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709878#comment-14709878 ] Davies Liu commented on SPARK-10177: [~lian cheng] After some investigation, just realized that we misunderstood the format used in Hive/parquet. the INT96 has two parts, day of julian day, nano seconds of the day. Because of an integer of julian day begin at noon of a day (12 pm), so they two parts are overlapped. They can't be interpreted as Julian day, and then add them together. It's weird but not wrong, I will send out a PR to fix this soon. > Parquet support interprets timestamp values differently from Hive 0.14.0+ > - > > Key: SPARK-10177 > URL: https://issues.apache.org/jira/browse/SPARK-10177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Attachments: 00_0 > > > Running the following SQL under Hive 0.14.0+ (tested against 0.14.0 and > 1.2.1): > {code:sql} > CREATE TABLE ts_test STORED AS PARQUET > AS SELECT CAST("2015-01-01 00:00:00" AS TIMESTAMP); > {code} > Then read the Parquet file generated by Hive with Spark SQL: > {noformat} > scala> > sqlContext.read.parquet("hdfs://localhost:9000/user/hive/warehouse_hive14/ts_test").collect() > res1: Array[org.apache.spark.sql.Row] = Array([2015-01-01 12:00:00.0]) > {noformat} > This issue can be easily reproduced with [this test case in PR > #8392|https://github.com/apache/spark/pull/8392/files#diff-1e55698cc579cbae676f827a89c2dc2eR116]. > Spark 1.4.1 works as expected in this case. > > Update: > Seems that the problem is that we do Julian day conversion wrong in > {{DateTimeUtils}}. The following {{spark-shell}} session illustrates it: > {code} > import java.sql._ > import java.util._ > import org.apache.hadoop.hive.ql.io.parquet.timestamp._ > import org.apache.spark.sql.catalyst.util._ > TimeZone.setDefault(TimeZone.getTimeZone("GMT")) > val ts = Timestamp.valueOf("1970-01-01 00:00:00") > val nt = NanoTimeUtils.getNanoTime(ts, false) > val jts = DateTimeUtils.fromJulianDay(nt.getJulianDay, nt.getTimeOfDayNanos) > DateTimeUtils.toJavaTimestamp(jts) > // ==> java.sql.Timestamp = 1970-01-01 12:00:00.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9401) Fully implement code generation for ConcatWs
[ https://issues.apache.org/jira/browse/SPARK-9401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9401. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8353 [https://github.com/apache/spark/pull/8353] > Fully implement code generation for ConcatWs > > > Key: SPARK-9401 > URL: https://issues.apache.org/jira/browse/SPARK-9401 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.6.0 > > > In ConcatWs, we fall back to interpreted mode if the input is a mix of string > and array of strings. We should have code gen for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option
[ https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706289#comment-14706289 ] Davies Liu commented on SPARK-9228: --- Right now, it's an internal configuration (could be changed or removed in next release), we keep them only for debug purpose. > Combine unsafe and codegen into a single option > --- > > Key: SPARK-9228 > URL: https://issues.apache.org/jira/browse/SPARK-9228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9400) Implement code generation for StringLocate
[ https://issues.apache.org/jira/browse/SPARK-9400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9400. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8330 [https://github.com/apache/spark/pull/8330] > Implement code generation for StringLocate > -- > > Key: SPARK-9400 > URL: https://issues.apache.org/jira/browse/SPARK-9400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-3533: --- Re-open this for discussion. > Add saveAsTextFileByKey() method to RDDs > > > Key: SPARK-3533 > URL: https://issues.apache.org/jira/browse/SPARK-3533 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Nicholas Chammas > > Users often have a single RDD of key-value pairs that they want to save to > multiple locations based on the keys. > For example, say I have an RDD like this: > {code} > >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', > >>> 'Frankie']).keyBy(lambda x: x[0]) > >>> a.collect() > [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] > >>> a.keys().distinct().collect() > ['B', 'F', 'N'] > {code} > Now I want to write the RDD out to different paths depending on the keys, so > that I have one output directory per distinct key. Each output directory > could potentially have multiple {{part-}} files, one per RDD partition. > So the output would look something like: > {code} > /path/prefix/B [/part-1, /part-2, etc] > /path/prefix/F [/part-1, /part-2, etc] > /path/prefix/N [/part-1, /part-2, etc] > {code} > Though it may be possible to do this with some combination of > {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the > {{MultipleTextOutputFormat}} output format class, it isn't straightforward. > It's not clear if it's even possible at all in PySpark. > Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs > that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10129) math function: stddev_samp
Davies Liu created SPARK-10129: -- Summary: math function: stddev_samp Key: SPARK-10129 URL: https://issues.apache.org/jira/browse/SPARK-10129 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Use the STDDEV_SAMP function to return the standard deviation of a sample variance. http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_stdev_samp.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.
[ https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703583#comment-14703583 ] Davies Liu commented on SPARK-7379: --- [~mengxr] This is a known issue that how to unpickle a bytearray from Pyrolite in Python 3 (it's OK in Python 2), is it related to that? I have not figured out a way to let Pyrolite compatible wit both Python 2 and 3. > pickle.loads expects a string instead of bytes in Python 3. > --- > > Key: SPARK-7379 > URL: https://issues.apache.org/jira/browse/SPARK-7379 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Davies Liu > > In PickleSerializer, we call pickle.loads in Python 3. However, the input obj > could be bytes, which works in Python 2 but not 3. > The error message is > {code} > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py", > line 418, in loads > return pickle.loads(obj, encoding=encoding) > TypeError: must be a unicode character, not bytes > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703474#comment-14703474 ] Davies Liu commented on SPARK-9644: --- [~robbinspg] It's fixed by https://issues.apache.org/jira/browse/SPARK-10095 > Support update DecimalType with precision > 18 in UnsafeRow > --- > > Key: SPARK-9644 > URL: https://issues.apache.org/jira/browse/SPARK-9644 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > Currently, we don't support using DecimalType with precision > 18 in new > unsafe aggregation, it's good to support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6798) Fix Date serialization in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702625#comment-14702625 ] Davies Liu commented on SPARK-6798: --- Not a bug, it's just not efficient. > Fix Date serialization in SparkR > > > Key: SPARK-6798 > URL: https://issues.apache.org/jira/browse/SPARK-6798 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Davies Liu >Priority: Minor > > SparkR's date serialization right now sends strings from R to the JVM. We > should convert this to integers and also account for timezones correctly by > using DateUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6798) Fix Date serialization in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6798: -- Issue Type: Improvement (was: Bug) > Fix Date serialization in SparkR > > > Key: SPARK-6798 > URL: https://issues.apache.org/jira/browse/SPARK-6798 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Davies Liu >Priority: Minor > > SparkR's date serialization right now sends strings from R to the JVM. We > should convert this to integers and also account for timezones correctly by > using DateUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10107) NPE in format_number
Davies Liu created SPARK-10107: -- Summary: NPE in format_number Key: SPARK-10107 URL: https://issues.apache.org/jira/browse/SPARK-10107 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker {code} sqlContext.range(1<<20).selectExpr("format_number(id, 2)").show() {code} {code} java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:93) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$2$$anonfun$apply$3.apply(basicOperators.scala:88) at org.apache.spark.sql.execution.TungstenProject$$anonfun$2$$anonfun$apply$3.apply(basicOperators.scala:86) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:129) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10095) Should not use the private field of BigInteger
[ https://issues.apache.org/jira/browse/SPARK-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10095. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8286 [https://github.com/apache/spark/pull/8286] > Should not use the private field of BigInteger > -- > > Key: SPARK-10095 > URL: https://issues.apache.org/jira/browse/SPARK-10095 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Minor > Fix For: 1.5.0 > > > In UnsafeRow, we use the private field of BigInteger for better performance, > but it actually didn't contribute much to end-to-end runtime, and make it not > portable (may fail on other JVM implementations). > So we should use the public API instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached
[ https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702390#comment-14702390 ] Davies Liu commented on SPARK-9627: --- I can reproduce it with latest master. > SQL job failed if the dataframe with string columns is cached > - > > Key: SPARK-9627 > URL: https://issues.apache.org/jira/browse/SPARK-9627 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Cheng Lian >Priority: Blocker > > {code} > r = random.Random() > def gen(i): > d = date.today() - timedelta(r.randint(0, 5000)) > cat = str(r.randint(0, 20)) * 5 > c = r.randint(0, 1000) > price = decimal.Decimal(r.randint(0, 10)) / 100 > return (d, cat, c, price) > schema = StructType().add('date', DateType()).add('cat', > StringType()).add('count', ShortType()).add('price', DecimalType(5, 2)) > #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema) > #df.show() > #df.write.parquet('sales4') > df = sqlContext.read.parquet('sales4') > df.cache() > df.count() > df.show() > print df.schema > raw_input() > r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price)) > print r.explain(True) > r.show() > {code} > {code} > StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true))) > == Parsed Logical Plan == > 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS > sum((count * price))#70] > Relation[date#0,cat#1,count#2,price#3] > org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 > == Analyzed Logical Plan == > date: date, cat: string, sum((count * price)): decimal(21,2) > Aggregate [date#0,cat#1], > [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * > price))#70] > Relation[date#0,cat#1,count#2,price#3] > org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 > == Optimized Logical Plan == > Aggregate [date#0,cat#1], > [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * > price))#70] > InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, > StorageLevel(true, true, false, true, 1), (PhysicalRDD > [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None > == Physical Plan == > NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) > ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, > DecimalType(11,2)2,mode=Final,isDistinct=false)) > TungstenSort [date#0 ASC,cat#1 ASC], false, 0 > ConvertToUnsafe >Exchange hashpartitioning(date#0,cat#1) > NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) > ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, > DecimalType(11,2)2,mode=Partial,isDistinct=false)) > TungstenSort [date#0 ASC,cat#1 ASC], false, 0 > ConvertToUnsafe >InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], > (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, > StorageLevel(true, true, false, true, 1), (PhysicalRDD > [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None) > Code Generation: true > == RDD == > None > 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "t.py", line 34, in > r.show() > File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, > in show > print(self._jdf.showString(n, truncate)) > File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, > in __call__ > self.target_id, self.name) > File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in > deco > return f(*a, **kw) > File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in > get_return_value > format(target_id, '.', name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 > (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty > list > at scala.collection.immutable.Nil$.tail(List.scala:339) > at scala.collection.immutable.Nil$.tail(List.scala:334) > at scala.reflect.internal.SymbolTable.popPh
[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached
[ https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702387#comment-14702387 ] Davies Liu commented on SPARK-9627: --- The `df.show()` will succeed, but `df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))` will fail (you need to press enter to run it, or remove the `raw_input()` line) > SQL job failed if the dataframe with string columns is cached > - > > Key: SPARK-9627 > URL: https://issues.apache.org/jira/browse/SPARK-9627 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Cheng Lian >Priority: Blocker > > {code} > r = random.Random() > def gen(i): > d = date.today() - timedelta(r.randint(0, 5000)) > cat = str(r.randint(0, 20)) * 5 > c = r.randint(0, 1000) > price = decimal.Decimal(r.randint(0, 10)) / 100 > return (d, cat, c, price) > schema = StructType().add('date', DateType()).add('cat', > StringType()).add('count', ShortType()).add('price', DecimalType(5, 2)) > #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema) > #df.show() > #df.write.parquet('sales4') > df = sqlContext.read.parquet('sales4') > df.cache() > df.count() > df.show() > print df.schema > raw_input() > r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price)) > print r.explain(True) > r.show() > {code} > {code} > StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true))) > == Parsed Logical Plan == > 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS > sum((count * price))#70] > Relation[date#0,cat#1,count#2,price#3] > org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 > == Analyzed Logical Plan == > date: date, cat: string, sum((count * price)): decimal(21,2) > Aggregate [date#0,cat#1], > [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * > price))#70] > Relation[date#0,cat#1,count#2,price#3] > org.apache.spark.sql.parquet.ParquetRelation@5ec8f315 > == Optimized Logical Plan == > Aggregate [date#0,cat#1], > [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * > price))#70] > InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, > StorageLevel(true, true, false, true, 1), (PhysicalRDD > [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None > == Physical Plan == > NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) > ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, > DecimalType(11,2)2,mode=Final,isDistinct=false)) > TungstenSort [date#0 ASC,cat#1 ASC], false, 0 > ConvertToUnsafe >Exchange hashpartitioning(date#0,cat#1) > NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) > ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, > DecimalType(5,0)), DecimalType(11,2))) * > change_decimal_precision(CAST(price#3, > DecimalType(11,2)2,mode=Partial,isDistinct=false)) > TungstenSort [date#0 ASC,cat#1 ASC], false, 0 > ConvertToUnsafe >InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], > (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, > StorageLevel(true, true, false, true, 1), (PhysicalRDD > [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None) > Code Generation: true > == RDD == > None > 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "t.py", line 34, in > r.show() > File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, > in show > print(self._jdf.showString(n, truncate)) > File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, > in __call__ > self.target_id, self.name) > File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in > deco > return f(*a, **kw) > File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in > get_return_value > format(target_id, '.', name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 > (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty > list > at scala.collection.immutable.N
[jira] [Created] (SPARK-10095) Should not use the private field of BigInteger
Davies Liu created SPARK-10095: -- Summary: Should not use the private field of BigInteger Key: SPARK-10095 URL: https://issues.apache.org/jira/browse/SPARK-10095 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Minor In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701974#comment-14701974 ] Davies Liu commented on SPARK-9644: --- In a benchmark of doing aggregation on a decimal column (with precision > 18), this optimization only contribute 3% to end-to-end runtime, so I will revert this optimization to make it portable. Thanks, [~robbinspg]! > Support update DecimalType with precision > 18 in UnsafeRow > --- > > Key: SPARK-9644 > URL: https://issues.apache.org/jira/browse/SPARK-9644 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > Currently, we don't support using DecimalType with precision > 18 in new > unsafe aggregation, it's good to support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701891#comment-14701891 ] Davies Liu commented on SPARK-9644: --- [~robbinspg] Thanks for point this out, would is be possible to have different branch for them? The public API is very slow, we could have a fast path and slow path based on the OFFSET of signum and mag, or fast path for OpenJDK and slow path for others? > Support update DecimalType with precision > 18 in UnsafeRow > --- > > Key: SPARK-9644 > URL: https://issues.apache.org/jira/browse/SPARK-9644 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > Currently, we don't support using DecimalType with precision > 18 in new > unsafe aggregation, it's good to support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10056) PySpark Row - Support for row["columnName"] syntax
[ https://issues.apache.org/jira/browse/SPARK-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701883#comment-14701883 ] Davies Liu commented on SPARK-10056: [~maver1ck], Yes, changes in Row class and unit test for this feature. Any docs would be better, but I'm not sure where is the best place to have it. > PySpark Row - Support for row["columnName"] syntax > -- > > Key: SPARK-10056 > URL: https://issues.apache.org/jira/browse/SPARK-10056 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Maciej Bryński >Priority: Minor > > Right now you can get Row element: > - by column name: row.columnName > - by index: row[index] (where index is integer) > My proposition is to add following syntax: > row["columnName"] > It will be solution similar to DataFrame behaviour and should be quite easy > to implement with __getitem__ method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10079) Make `column` and `col` functions be S4 functions
[ https://issues.apache.org/jira/browse/SPARK-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701877#comment-14701877 ] Davies Liu commented on SPARK-10079: There are two public col/column function in Scala and Python, which return an Column from colName, I think this one is useful, could be added into SparkR. > Make `column` and `col` functions be S4 functions > - > > Key: SPARK-10079 > URL: https://issues.apache.org/jira/browse/SPARK-10079 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined > as S3 functions. I think it would be better to define them as S4 functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10090) After division, Decimal may have longer precision than expected
Davies Liu created SPARK-10090: -- Summary: After division, Decimal may have longer precision than expected Key: SPARK-10090 URL: https://issues.apache.org/jira/browse/SPARK-10090 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker In TPCDS Q59, the result should be DecimalType(37, 20), but got Decimal('0.69903637110664268591656984574863203607'), should be Decimal('0.69903637110664268592'). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5901) [PySpark] pickle classes in main module
[ https://issues.apache.org/jira/browse/SPARK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-5901. - Resolution: Invalid Target Version/s: (was: 1.5.0) couldpickle does support to serialize class in __main__, but pickle does not support that. > [PySpark] pickle classes in main module > --- > > Key: SPARK-5901 > URL: https://issues.apache.org/jira/browse/SPARK-5901 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu > > Currently, couldpickle does not support to serialize class object in main > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10065) Avoid triple copy of var-length objects in Array in tungsten projection
Davies Liu created SPARK-10065: -- Summary: Avoid triple copy of var-length objects in Array in tungsten projection Key: SPARK-10065 URL: https://issues.apache.org/jira/browse/SPARK-10065 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu The first copy happens when we calculate the size of each element, after that, we copy the elements into array buffer, finally we copy the array buffer into row buffer. We could calculate the total size first, then convert the elements into row buffer directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700044#comment-14700044 ] Davies Liu commented on SPARK-9427: --- Should we target this for 1.6? > Add expression functions in SparkR > -- > > Key: SPARK-9427 > URL: https://issues.apache.org/jira/browse/SPARK-9427 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yu Ishikawa > > The list of functions to add is based on SQL's functions. And it would be > better to add them in one shot PR. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10038) TungstenProject code generation fails when applied to array
[ https://issues.apache.org/jira/browse/SPARK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-10038: -- Assignee: Davies Liu > TungstenProject code generation fails when applied to array > --- > > Key: SPARK-10038 > URL: https://issues.apache.org/jira/browse/SPARK-10038 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Blocker > > During fuzz testing, I discovered that TungstenProject can crash when applied > to schemas that contain {{array}} columns. As a minimal example, the > following code crashes in spark-shell: > {code} > sc.parallelize(Seq((Array(Array[Byte](1)), 1))).toDF.select("_1").rdd.count() > {code} > Here's the stacktrace: > {code} > 15/08/16 17:11:49 ERROR Executor: Exception in task 3.0 in stage 29.0 (TID > 144) > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: Line 53, Column 63: > '{' expected instead of '[' > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > exprs) { > return new SpecificUnsafeProjection(exprs); > } > class SpecificUnsafeProjection extends > org.apache.spark.sql.catalyst.expressions.UnsafeProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private UnsafeRow convertedStruct2; > private byte[] buffer3; > private int cursor4; > private UnsafeArrayData convertedArray6; > private byte[] buffer7; > public > SpecificUnsafeProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expressions) { > this.expressions = expressions; > this.convertedStruct2 = new UnsafeRow(); > this.buffer3 = new byte[16]; > this.cursor4 = 0; > convertedArray6 = new UnsafeArrayData(); > buffer7 = new byte[64]; > } > // Scala.Function1 need this > public Object apply(Object row) { > return apply((InternalRow) row); > } > public UnsafeRow apply(InternalRow i) { > cursor4 = 16; > convertedStruct2.pointTo(buffer3, Platform.BYTE_ARRAY_OFFSET, 1, cursor4); > /* input[0, ArrayType(BinaryType,true)] */ > boolean isNull0 = i.isNullAt(0); > ArrayData primitive1 = isNull0 ? null : (i.getArray(0)); > final boolean isNull8 = isNull0; > if (!isNull8) { > final ArrayData tmp9 = primitive1; > if (tmp9 instanceof UnsafeArrayData) { > convertedArray6 = (UnsafeArrayData) tmp9; > } else { > final int numElements10 = tmp9.numElements(); > final int fixedSize11 = 4 * numElements10; > int numBytes12 = fixedSize11; > final byte[][] elements13 = new byte[][numElements10]; > for (int index15 = 0; index15 < numElements10; index15++) { > if (!tmp9.isNullAt(index15)) { > elements13[index15] = tmp9.getBinary(index15); > numBytes12 += > org.apache.spark.sql.catalyst.expressions.UnsafeWriters$BinaryWriter.getSize(elements13[index15]); > } > } > if (numBytes12 > buffer7.length) { > buffer7 = new byte[numBytes12]; > } > int cursor14 = fixedSize11; > for (int index15 = 0; index15 < numElements10; index15++) { > if (elements13[index15] == null) { > // If element is null, write the negative value address into > offset region. > Platform.putInt(buffer7, Platform.BYTE_ARRAY_OFFSET + 4 * > index15, -cursor14); > } else { > Platform.putInt(buffer7, Platform.BYTE_ARRAY_OFFSET + 4 * > index15, cursor14); > cursor14 += > org.apache.spark.sql.catalyst.expressions.UnsafeWriters$BinaryWriter.write( > buffer7, > Platform.BYTE_ARRAY_OFFSET + cursor14, > elements13[index15]); > } > } > convertedArray6.pointTo( > buffer7, > Platform.BYTE_ARRAY_OFFSET, > numElements10, > numBytes12); > } > } > int numBytes16 = cursor4 + (isNull8 ? 0 : > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$ArrayWriter.getSize(convertedArray6)); > if (buffer3.length < numBytes16) { > // This will not happen frequently, because the buffer is re-used. > byte[] tmpBuffer5 = new byte[numBytes16 * 2]; > Platform.copyMemory(buffer3, Platform.BYTE_ARRAY_OFFSET, > tmpBuffer5, Platform.BYTE_ARRAY_OFFSET, buffer3.length); > buffer3 = tmpBuffer5; > } > convertedStruct2.pointTo(buffer3, Platform.BYTE_ARRAY_OFFSET, 1, > numBytes16); > if (isNull8) { > convertedStruct2.setNullAt(0); > } else { > cursor4 += > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$Arr
[jira] [Commented] (SPARK-10056) PySpark Row - Support for row["columnName"] syntax
[ https://issues.apache.org/jira/browse/SPARK-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700028#comment-14700028 ] Davies Liu commented on SPARK-10056: We support accessing nested column by `df['a.b']`(it's the behavior came from Scala), should we also support that(I don't think so). [~maver1ck] It will be great if you could send out a PR for this. > PySpark Row - Support for row["columnName"] syntax > -- > > Key: SPARK-10056 > URL: https://issues.apache.org/jira/browse/SPARK-10056 > Project: Spark > Issue Type: Improvement >Reporter: Maciej Bryński > > Right now you can get Row element: > - by column name: row.columnName > - by index: row[index] (where index is integer) > My proposition is to add following syntax: > row["columnName"] > It will be solution similar to DataFrame behaviour and should be quite easy > to implement with __getitem__ method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9705) outdated Python 3 and IPython information
[ https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699983#comment-14699983 ] Davies Liu commented on SPARK-9705: --- The PyLab thing is already fixed by https://github.com/apache/spark/pull/5111, I will send out a PR to fix others. > outdated Python 3 and IPython information > - > > Key: SPARK-9705 > URL: https://issues.apache.org/jira/browse/SPARK-9705 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Piotr Migdał >Priority: Blocker > Labels: documentation > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to > 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) > says explicitly: > "Spark 1.4.1 works with Python 2.6 or higher (but not Python 3)." > Affected: > https://spark.apache.org/docs/1.4.0/programming-guide.html > https://spark.apache.org/docs/1.4.1/programming-guide.html > There are some other Python-related things, which are outdated, e.g. this > line: > "For example, to launch the IPython Notebook with PyLab plot support:" > (At least since IPython 3.0 PyLab/Matplotlib support happens inside a > notebook; and the line "--pylab inline" is already removed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9705) outdated Python 3 and IPython information
[ https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699973#comment-14699973 ] Davies Liu commented on SPARK-9705: --- As the exception said, PySpark cannot run with different minor versions, you need to use the same minor version of Python, for example, use both Python 3.4 on driver and workers. Be default, PySpark use the default `python` found in PATH, if you have different default python in driver and worker, then you need to specify PYSPARK_PYTHON to tell pyspark which version of Python you want to use. For example: {code} PYSPARK_PYTHON=the_path_of_python_3.4 bin/spark-submit xxx {code} > outdated Python 3 and IPython information > - > > Key: SPARK-9705 > URL: https://issues.apache.org/jira/browse/SPARK-9705 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Piotr Migdał >Priority: Blocker > Labels: documentation > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to > 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) > says explicitly: > "Spark 1.4.1 works with Python 2.6 or higher (but not Python 3)." > Affected: > https://spark.apache.org/docs/1.4.0/programming-guide.html > https://spark.apache.org/docs/1.4.1/programming-guide.html > There are some other Python-related things, which are outdated, e.g. this > line: > "For example, to launch the IPython Notebook with PyLab plot support:" > (At least since IPython 3.0 PyLab/Matplotlib support happens inside a > notebook; and the line "--pylab inline" is already removed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9822) Update doc about supported Python versions
[ https://issues.apache.org/jira/browse/SPARK-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-9822. - Resolution: Duplicate > Update doc about supported Python versions > -- > > Key: SPARK-9822 > URL: https://issues.apache.org/jira/browse/SPARK-9822 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Davies Liu >Assignee: Davies Liu > > Not sure if here's the right place to post this, but the documentation on the > official website appears to be outdated. For example, for spark 1.4.0 and > 1.4.1 this paragraph (python tab) seems particularly misleading. Also, the > last line of this paragraph doesn't mention python 3 support. Maybe there are > other places. > https://github.com/apache/spark/pull/5173#issuecomment-129903372 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10059) Broken test: YarnClusterSuite
Davies Liu created SPARK-10059: -- Summary: Broken test: YarnClusterSuite Key: SPARK-10059 URL: https://issues.apache.org/jira/browse/SPARK-10059 Project: Spark Issue Type: Test Reporter: Davies Liu Priority: Critical This test failed everytime: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/116/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/_It_is_not_a_test_/history/ {code} Error Message java.io.IOException: ResourceManager failed to start. Final state is STOPPED Stacktrace sbt.ForkMain$ForkError: java.io.IOException: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:302) at org.apache.hadoop.yarn.server.MiniYARNCluster.access$500(MiniYARNCluster.java:87) at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:422) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:104) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:46) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:46) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: ResourceManager failed to start. Final state is STOPPED at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:297) ... 18 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat
Davies Liu created SPARK-10058: -- Summary: Flaky test: HeartbeatReceiverSuite: normal heartbeat Key: SPARK-10058 URL: https://issues.apache.org/jira/browse/SPARK-10058 Project: Spark Issue Type: Test Components: Spark Core Reporter: Davies Liu Assignee: Andrew Or https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/ {code} Error Message 3 did not equal 2 Stacktrace sbt.ForkMain$ForkError: 3 did not equal 2 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply$mcV$sp(HeartbeatReceiverSuite.scala:104) at org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97) at org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HeartbeatReceiverSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.HeartbeatReceiverSuite.runTest(HeartbeatReceiverSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterAll$$super$run(HeartbeatReceiverSuite.scala:41) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.HeartbeatReceiverSuite.run(HeartbeatReceiverSuite.scala:41) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} This is really flaky (fail 30%) https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/Heartbe
[jira] [Commented] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type
[ https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699861#comment-14699861 ] Davies Liu commented on SPARK-9982: --- R has numeric (similar to double), no decimal, is it ok to turn all the DecimalType into Double in SparkR ? cc [~shivavam] > SparkR DataFrame fail to return data of Decimal type > > > Key: SPARK-9982 > URL: https://issues.apache.org/jira/browse/SPARK-9982 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.1 >Reporter: Alex Shkurenko > > Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but > with the Decimal datatype coming from a Postgres DB: > //Set up SparkR > >Sys.setenv(SPARK_HOME="/Users/ashkurenko/work/git_repos/spark") > >Sys.setenv(SPARKR_SUBMIT_ARGS="--driver-class-path > >~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell") > >.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > >library(SparkR) > >sc <- sparkR.init(master="local") > // Connect to a Postgres DB via JDBC > >sqlContext <- sparkRSQL.init(sc) > >sql(sqlContext, " > CREATE TEMPORARY TABLE mytable > USING org.apache.spark.sql.jdbc > OPTIONS (url 'jdbc:postgresql://servername:5432/dbname' > ,dbtable 'mydbtable' > ) > ") > // Try pulling a Decimal column from a table > >myDataFrame <- sql(sqlContext,("select a_decimal_column from mytable ")) > // The schema shows up fine > >show(myDataFrame) > DataFrame[a_decimal_column:decimal(10,0)] > >schema(myDataFrame) > StructType > |-name = "a_decimal_column", type = "DecimalType(10,0)", nullable = TRUE > // ... but pulling data fails: > localDF <- collect(myDataFrame) > Error in as.data.frame.default(x[[i]], optional = TRUE) : > cannot coerce class ""jobj"" to a data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type
[ https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699861#comment-14699861 ] Davies Liu edited comment on SPARK-9982 at 8/17/15 5:21 PM: R has numeric (similar to double), no decimal, is it ok to turn all the DecimalType into Double in SparkR ? cc @Shivaram Venkataraman was (Author: davies): R has numeric (similar to double), no decimal, is it ok to turn all the DecimalType into Double in SparkR ? cc [~shivavam] > SparkR DataFrame fail to return data of Decimal type > > > Key: SPARK-9982 > URL: https://issues.apache.org/jira/browse/SPARK-9982 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.1 >Reporter: Alex Shkurenko > > Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but > with the Decimal datatype coming from a Postgres DB: > //Set up SparkR > >Sys.setenv(SPARK_HOME="/Users/ashkurenko/work/git_repos/spark") > >Sys.setenv(SPARKR_SUBMIT_ARGS="--driver-class-path > >~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell") > >.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > >library(SparkR) > >sc <- sparkR.init(master="local") > // Connect to a Postgres DB via JDBC > >sqlContext <- sparkRSQL.init(sc) > >sql(sqlContext, " > CREATE TEMPORARY TABLE mytable > USING org.apache.spark.sql.jdbc > OPTIONS (url 'jdbc:postgresql://servername:5432/dbname' > ,dbtable 'mydbtable' > ) > ") > // Try pulling a Decimal column from a table > >myDataFrame <- sql(sqlContext,("select a_decimal_column from mytable ")) > // The schema shows up fine > >show(myDataFrame) > DataFrame[a_decimal_column:decimal(10,0)] > >schema(myDataFrame) > StructType > |-name = "a_decimal_column", type = "DecimalType(10,0)", nullable = TRUE > // ... but pulling data fails: > localDF <- collect(myDataFrame) > Error in as.data.frame.default(x[[i]], optional = TRUE) : > cannot coerce class ""jobj"" to a data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9427) Add expression functions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699843#comment-14699843 ] Davies Liu commented on SPARK-9427: --- [~yu_ishikawa] `rand` does work in PySpark (Python 2.7): {code} >>> sqlContext.range(10).select(rand(2), "id").show() +---+---+ | rand()| id| +---+---+ | 0.6038577325006693| 0| | 0.6319470735268434| 1| |0.22327628846133507| 2| |0.24223739932588373| 3| | 0.8395518879513995| 4| | 0.5662927043813443| 5| | 0.2057736041310516| 6| | 0.3408245196642603| 7| |0.08641290347537589| 8| |0.46561147527615276| 9| +---+---+ {code} > Add expression functions in SparkR > -- > > Key: SPARK-9427 > URL: https://issues.apache.org/jira/browse/SPARK-9427 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yu Ishikawa > > The list of functions to add is based on SQL's functions. And it would be > better to add them in one shot PR. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder
Davies Liu created SPARK-10057: -- Summary: Faill to load class org.slf4j.impl.StaticLoggerBinder Key: SPARK-10057 URL: https://issues.apache.org/jira/browse/SPARK-10057 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Davies Liu Some loggings are dropped, because it can't load class "org.slf4j.impl.StaticLoggerBinder" {code} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC
[ https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698121#comment-14698121 ] Davies Liu commented on SPARK-9734: --- [~rmullapudi] It's fixed by https://github.com/apache/spark/pull/8219 > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > - > > Key: SPARK-9734 > URL: https://issues.apache.org/jira/browse/SPARK-9734 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Greg Rahn >Assignee: Davies Liu > Fix For: 1.5.0 > > > When using a basic example of reading the EMP table from Redshift via > spark-redshift, and writing the data back to Redshift, Spark fails with the > below error, related to Numeric/Decimal data types. > Redshift table: > {code} > testdb=# \d emp > Table "public.emp" > Column | Type | Modifiers > --+---+--- > empno| integer | > ename| character varying(10) | > job | character varying(9) | > mgr | integer | > hiredate | date | > sal | numeric(7,2) | > comm | numeric(7,2) | > deptno | integer | > testdb=# select * from emp; > empno | ename |job| mgr | hiredate | sal | comm | deptno > ---++---+--++-+-+ > 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 |NULL | 20 > 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 > 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 > 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 |NULL | 10 > 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10 > 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20 > 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 |NULL | 20 > 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 > 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 |NULL | 20 > 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 |NULL | 30 > 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 |NULL | 20 > 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 |0.00 | 30 > 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 |NULL | 30 > 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10 > (14 rows) > {code} > Spark Code: > {code} > val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx" > val driver = "com.amazon.redshift.jdbc41.Driver" > val t = > sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", > driver).option("url", url).option("dbtable", "emp").option("tempdir", > "s3n://spark-temp-dir").load() > t.registerTempTable("SparkTempTable") > val t1 = sqlContext.sql("select * from SparkTempTable") > t1.write.format("com.databricks.spark.redshift").option("driver", > driver).option("url", url).option("dbtable", "t1").option("tempdir", > "s3n://spark-temp-dir").option("avrocompression", > "snappy").mode("error").save() > {code} > Error Stack: > {code} > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.schemaString(jdbc.scala:132) > at > org.apache.spark.sql.jdbc.JDBCWrapper.schemaString(RedshiftJDBCWrapper.scala:28) > at > com.databricks.spark.redshift.RedshiftWriter.createTableSql(RedshiftWriter.scala:39) > at > com.databricks.spark.redshift.RedshiftWriter.doRedshiftLoad(RedshiftWriter.scala:105) > at > com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:145) > at > com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:92) > at org.apache.spark.sql.sources.ResolvedD
[jira] [Resolved] (SPARK-9725) spark sql query string field return empty/garbled string
[ https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9725. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8210 [https://github.com/apache/spark/pull/8210] > spark sql query string field return empty/garbled string > > > Key: SPARK-9725 > URL: https://issues.apache.org/jira/browse/SPARK-9725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Fei Wang >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > to reproduce it: > 1 deploy spark cluster mode, i use standalone mode locally > 2 set executor memory >= 32g, set following config in spark-default.xml >spark.executor.memory36g > 3 run spark-sql.sh with "show tables" it return empty/garbled string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC
[ https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697998#comment-14697998 ] Davies Liu commented on SPARK-9734: --- [~rmullapudi] Good catch! I will sending out a quick fix! > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > - > > Key: SPARK-9734 > URL: https://issues.apache.org/jira/browse/SPARK-9734 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Greg Rahn >Assignee: Davies Liu > Fix For: 1.5.0 > > > When using a basic example of reading the EMP table from Redshift via > spark-redshift, and writing the data back to Redshift, Spark fails with the > below error, related to Numeric/Decimal data types. > Redshift table: > {code} > testdb=# \d emp > Table "public.emp" > Column | Type | Modifiers > --+---+--- > empno| integer | > ename| character varying(10) | > job | character varying(9) | > mgr | integer | > hiredate | date | > sal | numeric(7,2) | > comm | numeric(7,2) | > deptno | integer | > testdb=# select * from emp; > empno | ename |job| mgr | hiredate | sal | comm | deptno > ---++---+--++-+-+ > 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 |NULL | 20 > 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 > 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 > 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 |NULL | 10 > 7839 | KING | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10 > 7876 | ADAMS | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20 > 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 |NULL | 20 > 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 > 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 |NULL | 20 > 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 |NULL | 30 > 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 | 3000.00 |NULL | 20 > 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 |0.00 | 30 > 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 |NULL | 30 > 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10 > (14 rows) > {code} > Spark Code: > {code} > val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx" > val driver = "com.amazon.redshift.jdbc41.Driver" > val t = > sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", > driver).option("url", url).option("dbtable", "emp").option("tempdir", > "s3n://spark-temp-dir").load() > t.registerTempTable("SparkTempTable") > val t1 = sqlContext.sql("select * from SparkTempTable") > t1.write.format("com.databricks.spark.redshift").option("driver", > driver).option("url", url).option("dbtable", "t1").option("tempdir", > "s3n://spark-temp-dir").option("avrocompression", > "snappy").mode("error").save() > {code} > Error Stack: > {code} > java.lang.IllegalArgumentException: Don't know how to save > StructField(sal,DecimalType(7,2),true) to JDBC > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.schemaString(jdbc.scala:132) > at > org.apache.spark.sql.jdbc.JDBCWrapper.schemaString(RedshiftJDBCWrapper.scala:28) > at > com.databricks.spark.redshift.RedshiftWriter.createTableSql(RedshiftWriter.scala:39) > at > com.databricks.spark.redshift.RedshiftWriter.doRedshiftLoad(RedshiftWriter.scala:105) > at > com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:145) > at > com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:92) > at org.apache.spark.sql.sources.ResolvedDataSource$.a
[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN
[ https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697868#comment-14697868 ] Davies Liu commented on SPARK-9971: --- We had a long discussion about how to support NaN, but didn't find a good one from other databases to follow (they all handle them in different way from each other), then we came out a plan, see https://issues.apache.org/jira/browse/SPARK-9079. For 4 (in NaN in aggregation), we have not decided yet. The current behavior is not consistent across functions {code} >>> sf = sqlContext.createDataFrame([(1.0,), (float('nan'),)], ['f']) >>> sf.selectExpr('min(f)', 'max(f)', 'sum(f)', 'avg(f)', 'count(f)').show() +---+---+---+---+-+ |'min(f)|'max(f)|'sum(f)|'avg(f)|'count(f)| +---+---+---+---+-+ |1.0|NaN|NaN|NaN|2| +---+---+---+---+-+ {code} cc [~yhuai] [~rxin] [~joshrosen] > MaxFunction not working correctly with columns containing Double.NaN > > > Key: SPARK-9971 > URL: https://issues.apache.org/jira/browse/SPARK-9971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Frank Rosner >Priority: Minor > > h4. Problem Description > When using the {{max}} function on a {{DoubleType}} column that contains > {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. > This is because it compares all values with the running maximum. However, {{x > < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > > Double.NaN}}. > h4. How to Reproduce > {code} > import org.apache.spark.sql.{SQLContext, Row} > import org.apache.spark.sql.types._ > val sql = new SQLContext(sc) > val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d))) > val dataFrame = sql.createDataFrame(rdd, StructType(List( > StructField("col", DoubleType, false) > ))) > dataFrame.select(max("col")).first > // returns org.apache.spark.sql.Row = [NaN] > {code} > h4. Solution > The {{max}} and {{min}} functions should ignore NaN values, as they are not > numbers. If a column contains only NaN values, then the maximum and minimum > is not defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9978) Window functions require partitionBy to work as expected
[ https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-9978: - Assignee: Davies Liu > Window functions require partitionBy to work as expected > > > Key: SPARK-9978 > URL: https://issues.apache.org/jira/browse/SPARK-9978 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Maciej Szymkiewicz >Assignee: Davies Liu > > I am trying to reproduce following SQL query: > {code} > df.registerTempTable("df") > sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM > df").show() > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} > using PySpark API. Unfortunately it doesn't work as expected: > {code} > from pyspark.sql.window import Window > from pyspark.sql.functions import rowNumber > df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}]) > df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show() > ++--+ > | x|rn| > ++--+ > | 0.5| 1| > |0.25| 1| > |0.75| 1| > ++--+ > {code} > As a workaround It is possible to call partitionBy without additional > arguments: > {code} > df.select( > df["x"], > rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn") > ).show() > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} > but as far as I can tell it is not documented and is rather counterintuitive > considering SQL syntax. Moreover this problem doesn't affect Scala API: > {code} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.functions.rowNumber > case class Record(x: Double) > val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: > Record(0.75)) > df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show > ++--+ > | x|rn| > ++--+ > |0.25| 1| > | 0.5| 2| > |0.75| 3| > ++--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8
[ https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9589. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8177 [https://github.com/apache/spark/pull/8177] > Flaky test: HiveCompatibilitySuite.groupby8 > --- > > Key: SPARK-9589 > URL: https://issues.apache.org/jira/browse/SPARK-9589 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/ > {code} > sbt.ForkMain$ForkError: > Failed to execute query using catalyst: > Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 > times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, > localhost): java.lang.NullPointerException > at > org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226) > at > org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366) > at > org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.(UnsafeHybridAggregationIterator.scala:290) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358) > at > org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130) > at > org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9946) NPE in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9946. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8177 [https://github.com/apache/spark/pull/8177] > NPE in TaskMemoryManager > > > Key: SPARK-9946 > URL: https://issues.apache.org/jira/browse/SPARK-9946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.5.0 > > > {code} > Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 6 in stage 6801.0 > failed 1 times, most recent failure: Lost task 6.0 in stage 6801.0 (TID > 41123, localhost): java.lang.NullPointerException > [info]at > org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226) > [info]at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:165) > [info]at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:142) > [info]at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:129) > [info]at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > [info]at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:302) > [info]at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:218) > [info]at > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:110) > [info]at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > [info]at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > [info]at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > [info]at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99) > [info]at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:88) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > 13:27:17.435 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 23.0 > in stage 6801.0 (TID 41140, localhost): TaskKilled (killed intentionally) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > 13:27:17.436 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.0 > in stage 6801.0 (TID 41129, localhost): TaskKilled (killed intentionally) > [info]at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8
[ https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-9589: - Assignee: Davies Liu (was: Josh Rosen) > Flaky test: HiveCompatibilitySuite.groupby8 > --- > > Key: SPARK-9589 > URL: https://issues.apache.org/jira/browse/SPARK-9589 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/ > {code} > sbt.ForkMain$ForkError: > Failed to execute query using catalyst: > Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 > times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, > localhost): java.lang.NullPointerException > at > org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226) > at > org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366) > at > org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.(UnsafeHybridAggregationIterator.scala:290) > at > org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358) > at > org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130) > at > org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9725) spark sql query string field return empty/garbled string
[ https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697426#comment-14697426 ] Davies Liu edited comment on SPARK-9725 at 8/14/15 5:36 PM: I can reproduce this issue now (<32G for executor, > 32G for driver). Many things are broken: {code} scala> sqlContext.sql("select 'abc'").collect() res0: Array[org.apache.spark.sql.Row] = Array([???]) scala> sqlContext.range(10).registerTempTable("t") scala> sqlContext.sql("select 'abc',id from t").collect() res2: Array[org.apache.spark.sql.Row] = Array([???,0], [???,1], [???,2], [???,3], [???,4], [???,5], [???,6], [???,7], [???,8], [???,9]) {code} was (Author: davies): I can reproduce this issue now (<32G for executor, > 32G for driver). > spark sql query string field return empty/garbled string > > > Key: SPARK-9725 > URL: https://issues.apache.org/jira/browse/SPARK-9725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Fei Wang >Assignee: Davies Liu >Priority: Blocker > > to reproduce it: > 1 deploy spark cluster mode, i use standalone mode locally > 2 set executor memory >= 32g, set following config in spark-default.xml >spark.executor.memory36g > 3 run spark-sql.sh with "show tables" it return empty/garbled string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org