[jira] [Created] (SPARK-10542) The PySpark 1.5 closure serializer can't serialize a namedtuple instance.

2015-09-10 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10542:
--

 Summary: The  PySpark 1.5 closure serializer can't serialize a 
namedtuple instance.
 Key: SPARK-10542
 URL: https://issues.apache.org/jira/browse/SPARK-10542
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


Code to Reproduce Bug:
{code}
from collections import namedtuple
PowerPlantRow=namedtuple("PowerPlantRow", ["AT", "V", "AP", "RH", "PE"])
rdd=sc.parallelize([1]).map(lambda x: PowerPlantRow(1.0, 2.0, 3.0, 4.0, 5.0))
rdd.count()
{code}

Error message on Spark 1.5:
{code}
AttributeError: 'builtin_function_or_method' object has no attribute '__code__'
---
AttributeErrorTraceback (most recent call last)
 in ()
  2 PowerPlantRow=namedtuple("PowerPlantRow", ["AT", "V", "AP", "RH", "PE"])
  3 rdd=sc.parallelize([1]).map(lambda x: PowerPlantRow(1.0, 2.0, 3.0, 4.0, 
5.0))
> 4 rdd.count()

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in count(self)
   1004 3
   1005 """
-> 1006 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1007 
   1008 def stats(self):

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in sum(self)
995 6.0
996 """
--> 997 return self.mapPartitions(lambda x: [sum(x)]).fold(0, 
operator.add)
998 
999 def count(self):

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, 
op)
869 # zeroValue provided to each partition is unique from the one 
provided
870 # to the final reduce call
--> 871 vals = self.mapPartitions(func).collect()
872 return reduce(op, vals, zeroValue)
873 

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in collect(self)
771 """
772 with SCCallSiteSync(self.context) as css:
--> 773 port = 
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
774 return list(_load_from_socket(port, self._jrdd_deserializer))
775 

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in _jrdd(self)
   2383 command = (self.func, profiler, self._prev_jrdd_deserializer,
   2384self._jrdd_deserializer)
-> 2385 pickled_cmd, bvars, env, includes = 
_prepare_for_python_RDD(self.ctx, command, self)
   2386 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(),
   2387  bytearray(pickled_cmd),

/home/ubuntu/databricks/spark/python/pyspark/rdd.pyc in 
_prepare_for_python_RDD(sc, command, obj)
   2303 # the serialized command will be compressed by broadcast
   2304 ser = CloudPickleSerializer()
-> 2305 pickled_command = ser.dumps(command)
   2306 if len(pickled_command) > (1 << 20):  # 1M
   2307 # The broadcast will have same life cycle as created PythonRDD

/home/ubuntu/databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj)
425 
426 def dumps(self, obj):
--> 427 return cloudpickle.dumps(obj, 2)
428 
429 

/home/ubuntu/databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, 
protocol)
639 
640 cp = CloudPickler(file,protocol)
--> 641 cp.dump(obj)
642 
643 return file.getvalue()

/home/ubuntu/databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj)
105 self.inject_addons()
106 try:
--> 107 return Pickler.dump(self, obj)
108 except RuntimeError as e:
109 if 'recursion' in e.args[0]:

/usr/lib/python2.7/pickle.pyc in dump(self, obj)
222 if self.proto >= 2:
223 self.write(PROTO + chr(self.proto))
--> 224 self.save(obj)
225 self.write(STOP)
226 

/usr/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288 

/usr/lib/python2.7/pickle.pyc in save_tuple(self, obj)
560 write(MARK)
561 for element in obj:
--> 562 save(element)
563 
564 if id(obj) in memo:

/usr/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288 
... skipped 23125 bytes ...
650 
651 dispatch[DictionaryType] = save_dict

/usr/lib/python2.7/pickle.pyc in _batch_setitems(self, items)
684 k, v = tmp[0]
685 save(k)
--> 686 save(v)
687 write(SETITEM)
688 # else t

[jira] [Resolved] (SPARK-6931) python: struct.pack('!q', value) in write_long(value, stream) in serializers.py require int(but doesn't raise exceptions in common cases)

2015-09-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6931.
---
   Resolution: Fixed
Fix Version/s: 1.2.3
   1.3.2

Issue resolved by pull request 8594
[https://github.com/apache/spark/pull/8594]

> python: struct.pack('!q', value) in write_long(value, stream) in 
> serializers.py require int(but doesn't raise exceptions in common cases)
> -
>
> Key: SPARK-6931
> URL: https://issues.apache.org/jira/browse/SPARK-6931
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Chunxi Zhang
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.3.2, 1.2.3
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> when I map my own feature calculation module's function, sparks raises:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/daemon.py", line 
> 162, in manager
> code = worker(sock)
>   File 
> "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/daemon.py", line 
> 60, in worker
> worker_main(infile, outfile)
>   File 
> "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/worker.py", line 
> 115, in main
> report_times(outfile, boot_time, init_time, finish_time)
>   File 
> "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/worker.py", line 
> 40, in report_times
> write_long(1000 * boot, outfile)
>   File 
> "/usr/local/Cellar/apache-spark/1.3.0/libexec/python/pyspark/serializers.py", 
> line 518, in write_long
> stream.write(struct.pack("!q", value))
> DeprecationWarning: integer argument expected, got float
> so I turn on the serializers.py, and tried to print the value out, which is a 
> float, came from 1000 * time.time()
> when I removed my lib, or add a rdd.count() before mapping my lib, this bug 
> won’t appear.
> so I edited the function to :
> def write_long(value, stream):
> stream.write(struct.pack("!q", int(value))) # added a int(value)
> everything seem fine…
> According to python’s doc for 
> struct(https://docs.python.org/2/library/struct.html)’s Note(3), the value 
> should be a int(for q), and if it’s a float, it’ll try use __index__(), else, 
> try __int__, but since __int__ is deprecated, it’ll raise DeprecationWarning. 
> And float doesn’t have __index__, but has __int__, so it should raise the 
> exception every time.
> But, as you can see, in normal cases, it won’t raise the exception, and the 
> code works perfectly, and exec struct.pack('!q', 111.1) in console or a clean 
> file won't raise any exception…I can hardly tell how my lib might effect a 
> time.time()'s value passed to struct.pack()... it might a python's original 
> bug or what.
> Anyway, this value should be a int, so add a int() to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10065) Avoid triple copy of var-length objects in Array in tungsten projection

2015-09-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10065.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8496
[https://github.com/apache/spark/pull/8496]

> Avoid triple copy of var-length objects in Array in tungsten projection
> ---
>
> Key: SPARK-10065
> URL: https://issues.apache.org/jira/browse/SPARK-10065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
> Fix For: 1.6.0
>
>
> The first copy happens when we calculate the size of each element, after 
> that, we copy the elements into array buffer, finally we copy the array 
> buffer into row buffer. 
> We could calculate the total size first, then convert the elements into row 
> buffer directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9730) Sort Merge Join for Full Outer Join

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9730.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8579
[https://github.com/apache/spark/pull/8579]

> Sort Merge Join for Full Outer Join
> ---
>
> Key: SPARK-9730
> URL: https://issues.apache.org/jira/browse/SPARK-9730
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10522:
--

 Summary: Nanoseconds part of Timestamp should be positive in 
parquet
 Key: SPARK-10522
 URL: https://issues.apache.org/jira/browse/SPARK-10522
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu


If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10439) Catalyst should check for overflow / underflow of date and timestamp values

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737644#comment-14737644
 ] 

Davies Liu commented on SPARK-10439:


There are many places there could be overflow, even for A + B, so I think it's 
not big deal.

If we really want to handle them gracefully, those bound checking should be 
performed during inbound, turn them into null if overflow, not crash (raise 
exception).

> Catalyst should check for overflow / underflow of date and timestamp values
> ---
>
> Key: SPARK-10439
> URL: https://issues.apache.org/jira/browse/SPARK-10439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> While testing some code, I noticed that a few methods in {{DateTimeUtils}} 
> are prone to overflow and underflow.
> For example, {{millisToDays}} can overflow the return type ({{Int}}) if a 
> large enough input value is provided.
> Similarly, {{fromJavaTimestamp}} converts milliseconds to microseconds, which 
> can overflow if the input is {{> Long.MAX_VALUE / 1000}} (or underflow in the 
> negative case).
> There might be others but these were the ones that caught my eye.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10474:
---
Target Version/s: 1.6.0, 1.5.1
Priority: Blocker  (was: Critical)

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10519) Investigate if we should encode timezone information to a timestamp value stored in JSON

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737375#comment-14737375
 ] 

Davies Liu commented on SPARK-10519:


+1 for 3, user have the ability to control timezone, it's also compatible. 

> Investigate if we should encode timezone information to a timestamp value 
> stored in JSON
> 
>
> Key: SPARK-10519
> URL: https://issues.apache.org/jira/browse/SPARK-10519
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Since Spark 1.3, we store a timestamp in JSON without encoding the timezone 
> information and the string representation of a timestamp stored in JSON 
> implicitly using the local timezone (see 
> [1|https://github.com/apache/spark/blob/branch-1.3/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L454],
>  
> [2|https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/json/JacksonGenerator.scala#L38],
>  
> [3|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L41],
>  
> [4|https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L93]).
>  This behavior may cause the data consumers got different values when they 
> are in a different timezone with the data producers.
> Since JSON is string based, if we encode timezone information to timestamp 
> value, downstream applications may need to change their code (for example, 
> java.sql.Timestamp.valueOf only supports the format of {{-\[m]m-\[d]d 
> hh:mm:ss\[.f...]}}).
> We should investigate what we should do about this issue. Right now, I can 
> think of three options:
> 1. Encoding timezone info in the timestamp value, which can break user code 
> and may change the semantic of timestamp (our timestamp value is 
> timezone-less).
> 2. When saving a timestamp value to json, we treat this value as a value in 
> the local timezone and convert it to UTC time. Then, when save the data, we 
> do not encode timezone info in the value.
> 3. We do not change our current behavior. But, in our doc, we explicitly say 
> that users need to use a single timezone for their datasets (e.g. always use 
> UTC time). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10461) make sure `input.primitive` is always variable name not code at GenerateUnsafeProjection

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10461.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8613
[https://github.com/apache/spark/pull/8613]

> make sure `input.primitive` is always variable name not code at 
> GenerateUnsafeProjection
> 
>
> Key: SPARK-10461
> URL: https://issues.apache.org/jira/browse/SPARK-10461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185
 ] 

Davies Liu commented on SPARK-10309:


[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.eplain()` 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737185#comment-14737185
 ] 

Davies Liu edited comment on SPARK-10309 at 9/9/15 4:53 PM:


[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.explain()` 


was (Author: davies):
[~nadenf] Thanks for letting us know, just realized that your stacktrace 
already including that fix.

Maybe there are multiple join/aggregation/sort in your query? You can show the 
physical plan by `df.eplain()` 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10512.
--
Resolution: Won't Fix

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10512) Fix @since when a function doesn't have doc

2015-09-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736961#comment-14736961
 ] 

Davies Liu commented on SPARK-10512:


As we discussed here 
https://github.com/apache/spark/pull/8657#discussion_r38992400, we should add a 
doc for those public API, instead putting a workaround in @since. 

> Fix @since when a function doesn't have doc
> ---
>
> Key: SPARK-10512
> URL: https://issues.apache.org/jira/browse/SPARK-10512
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Yu Ishikawa
>
> When I tried to add @since to a function which doesn't have doc, @since 
> didn't go well. It seems that {{___doc___}} is {{None}} under {{since}} 
> decorator.
> {noformat}
> Traceback (most recent call last):
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/Users/01004981/.pyenv/versions/2.6.8/lib/python2.6/runpy.py", line 
> 34, in _run_code
> exec code in run_globals
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 46, in 
> class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, 
> JavaLoader):
>   File 
> "/Users/01004981/local/src/spark/myspark3/python/pyspark/mllib/recommendation.py",
>  line 166, in MatrixFactorizationModel
> @since("1.3.1")
>   File "/Users/01004981/local/src/spark/myspark3/python/pyspark/__init__.py", 
> line 63, in deco
> indents = indent_p.findall(f.__doc__)
> TypeError: expected string or buffer
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735908#comment-14735908
 ] 

Davies Liu commented on SPARK-10309:


This also could be related to 
https://issues.apache.org/jira/browse/SPARK-10341?filter=-2, could you test 
with 1.5-RC3?

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735903#comment-14735903
 ] 

Davies Liu commented on SPARK-10309:


[~nadenf] Could you post the physical plan here? That could help us to 
understand the root cause.

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735776#comment-14735776
 ] 

Davies Liu commented on SPARK-10466:


[~chenghao] I tried your test case, it passed in master. Is there other things 
I need to reproduce the failure?

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.sched

[jira] [Created] (SPARK-10494) Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)

2015-09-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10494:
--

 Summary: Multiple Python UDFs together with aggregation or sort 
merge join may cause OOM (failed to acquire memory)
 Key: SPARK-10494
 URL: https://issues.apache.org/jira/browse/SPARK-10494
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Critical


The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one query 
stage could end up evaluate upstream (SparkPlan) 2^N times.

In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, they 
will cause OOM (failed to acquire memory).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735644#comment-14735644
 ] 

Davies Liu commented on SPARK-8632:
---

The upstream means child of current SparkPlan, could have other Python UDFs. 

We remove the RDD cache in 1.4, then the upstream will be evaluated twice. If 
you have multiple Python UDFs, for example three, it will end up evaluate the 
child 8 times (2 x 2 x 2), which will be really slow or cause OOM.

In synchronous batch mode, what's the batch size? if it's small, the overhead 
of each batch will be high, if it's too large, it's easy to OOM if you have 
many columns. Also we need to copy the rows (serialization is not need if it's 
UnsafeRow).



> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735401#comment-14735401
 ] 

Davies Liu commented on SPARK-10309:


[~nadenf] In my case, the job finally finished (after retry), so this seems to 
be a blocker for me.

Could you provide more information about you job?

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735392#comment-14735392
 ] 

Davies Liu commented on SPARK-8632:
---

[~rxin] As [~justin.uang] suggested before, the batch mode will need to flush 
the rows in every place of the pipeline, or it get deadlock.

I think the goal is to call upstream once and improve the throughput of Python 
UDF (which is usually the bottleneck). The batch mode is increase the overhead 
of Python UDF (for each batch), cause worser performance. The problem of older 
cache is that serialization and memory management (also not purged after used) 
overhead. With one time (purged after visited) tungsten cache (and spilling), 
the overhead should be not that high, I think this should be the most 
performant and stable approach.

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8632:
-

Assignee: Davies Liu

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10459) PythonUDF could process UnsafeRow

2015-09-04 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10459:
--

 Summary: PythonUDF could process UnsafeRow
 Key: SPARK-10459
 URL: https://issues.apache.org/jira/browse/SPARK-10459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently, There will be ConvertToSafe for PythonUDF, that's not needed 
actually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename

2015-09-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10436:
---
Target Version/s: 1.6.0

> spark-submit overwrites spark.files defaults with the job script filename
> -
>
> Key: SPARK-10436
> URL: https://issues.apache.org/jira/browse/SPARK-10436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0
> Environment: Ubuntu, Spark 1.4.0 Standalone
>Reporter: axel dahl
>Priority: Minor
>  Labels: easyfix, feature
>
> In my spark-defaults.conf I have configured a set of libararies to be 
> uploaded to my Spark 1.4.0 Standalone cluster.  The entry appears as:
> spark.files  libarary.zip,file1.py,file2.py
> When I execute spark-submit -v test.py
> I see that spark-submit reads the defaults correctly, but that it overwrites 
> the "spark.files" default entry and replaces it with the name if the job 
> script, i.e. "test.py".
> This behavior doesn't seem intuitive.  test.py, should be added to the spark 
> working folder, but it should not overwrite the "spark.files" defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls

2015-09-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10434:
---
Priority: Minor  (was: Critical)

> Parquet compatibility with 1.4 is broken when writing arrays that may contain 
> nulls
> ---
>
> Key: SPARK-10434
> URL: https://issues.apache.org/jira/browse/SPARK-10434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> When writing arrays that may contain nulls, for example:
> {noformat}
> StructType(
>   StructField(
> "f",
> ArrayType(IntegerType, containsNull = true),
> nullable = false))
> {noformat}
> Spark 1.4 uses the following schema:
> {noformat}
> message m {
>   required group f (LIST) {
> repeated group bag {
>   optional int32 array;
> }
>   }
> }
> {noformat}
> This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level 
> structure and repeated group name "bag" are borrowed from parquet-hive, while 
> the innermost element field name "array" is borrowed from parquet-avro.
> However, in Spark 1.5, I failed to notice the latter fact and used a schema 
> in purely parquet-hive flavor, namely:
> {noformat}
> message m {
>   required group f (LIST) {
> repeated group bag {
>   optional int32 array_element;
> }
>   }
> }
> {noformat}
> One of the direct consequence is that, Parquet files containing such array 
> fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements 
> become null).
> To fix this issue, the name of the innermost field should be changed back to 
> "array".  Notice that this fix doesn't affect interoperability with Hive 
> (saving Parquet files using {{saveAsTable()}} and then read them using Hive).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10434) Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls

2015-09-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729445#comment-14729445
 ] 

Davies Liu commented on SPARK-10434:


[~lian cheng] I think it's hard to guarantee forward computability (former 
version can read data generated by new version), do we really need this change?

> Parquet compatibility with 1.4 is broken when writing arrays that may contain 
> nulls
> ---
>
> Key: SPARK-10434
> URL: https://issues.apache.org/jira/browse/SPARK-10434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> When writing arrays that may contain nulls, for example:
> {noformat}
> StructType(
>   StructField(
> "f",
> ArrayType(IntegerType, containsNull = true),
> nullable = false))
> {noformat}
> Spark 1.4 uses the following schema:
> {noformat}
> message m {
>   required group f (LIST) {
> repeated group bag {
>   optional int32 array;
> }
>   }
> }
> {noformat}
> This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level 
> structure and repeated group name "bag" are borrowed from parquet-hive, while 
> the innermost element field name "array" is borrowed from parquet-avro.
> However, in Spark 1.5, I failed to notice the latter fact and used a schema 
> in purely parquet-hive flavor, namely:
> {noformat}
> message m {
>   required group f (LIST) {
> repeated group bag {
>   optional int32 array_element;
> }
>   }
> }
> {noformat}
> One of the direct consequence is that, Parquet files containing such array 
> fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements 
> become null).
> To fix this issue, the name of the innermost field should be changed back to 
> "array".  Notice that this fix doesn't affect interoperability with Hive 
> (saving Parquet files using {{saveAsTable()}} and then read them using Hive).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10425) Add a regression test for SPARK-10379

2015-09-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729421#comment-14729421
 ] 

Davies Liu commented on SPARK-10425:


[~sowen] Thanks for your comment, The reason that PR didn't have a regression 
test is that I can't reproduce the failure after many tries (difference scale 
and workload). That's something need to be done obviously, also failed in 
user's workload.

This JIRA is requested by reviewer of that PR, I'm fine to close it. cc 
[~andrewor14]

> Add a regression test for SPARK-10379
> -
>
> Key: SPARK-10425
> URL: https://issues.apache.org/jira/browse/SPARK-10425
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10422) String column in InMemoryColumnarCache needs to override clone method

2015-09-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10422.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8578
[https://github.com/apache/spark/pull/8578]

> String column in InMemoryColumnarCache needs to override clone method
> -
>
> Key: SPARK-10422
> URL: https://issues.apache.org/jira/browse/SPARK-10422
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.5.0
>
>
> We have a clone method in {{ColumnType}} 
> (https://github.com/apache/spark/blob/v1.5.0-rc3/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala#L103).
>  Seems we need to override it for String 
> (https://github.com/apache/spark/blob/v1.5.0-rc3/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala#L314)
>  because we are dealing with UTF8String.
> {code}
> val df =
>   ctx.range(1, 3).selectExpr("id % 500 as id").rdd.map(id => 
> Tuple1(s"str_$id")).toDF("i")
> val cached = df.cache()
> cached.count()
> [info] - SPARK-10422: String column in InMemoryColumnarCache needs to 
> override clone method *** FAILED *** (9 seconds, 152 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.util.NoSuchElementException: key not 
> found: str_[0]
> [info]at scala.collection.MapLike$class.default(MapLike.scala:228)
> [info]at scala.collection.AbstractMap.default(Map.scala:58)
> [info]at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> [info]at 
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
> [info]at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
> [info]at 
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
> [info]at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
> [info]at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> [info]at 
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> [info]at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> [info]at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> [info]at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
> [info]at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10425) Add a regression test for SPARK-10379

2015-09-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10425:
--

 Summary: Add a regression test for SPARK-10379
 Key: SPARK-10425
 URL: https://issues.apache.org/jira/browse/SPARK-10425
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10424) ShuffleHashOuterJoin should consider condition

2015-09-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10424:
---
Priority: Blocker  (was: Major)

> ShuffleHashOuterJoin should consider condition
> --
>
> Key: SPARK-10424
> URL: https://issues.apache.org/jira/browse/SPARK-10424
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Priority: Blocker
>
> Currently, ShuffleHashOuterJoin does not consider condition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10424) ShuffleHashOuterJoin should consider condition

2015-09-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10424:
--

 Summary: ShuffleHashOuterJoin should consider condition
 Key: SPARK-10424
 URL: https://issues.apache.org/jira/browse/SPARK-10424
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


Currently, ShuffleHashOuterJoin does not consider condition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10417) Iterating through Column results in infinite loop

2015-09-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10417.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8574
[https://github.com/apache/spark/pull/8574]

> Iterating through Column results in infinite loop
> -
>
> Key: SPARK-10417
> URL: https://issues.apache.org/jira/browse/SPARK-10417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Zoltan Toth
>Priority: Minor
> Fix For: 1.6.0
>
>
> Iterating through a _Column_ object results in an infinite loop.
> {code}
> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
> for i in df["name"]: print i
> {code}
> Result:
> {code}
> Column
> Column
> Column
> Column
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10392:
---
Fix Version/s: 1.5.1

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
> Fix For: 1.6.0, 1.5.1
>
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10392) Pyspark - Wrong DateType support on JDBC connection

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10392.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8556
[https://github.com/apache/spark/pull/8556]

> Pyspark - Wrong DateType support on JDBC connection
> ---
>
> Key: SPARK-10392
> URL: https://issues.apache.org/jira/browse/SPARK-10392
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Bryński
> Fix For: 1.6.0
>
>
> I have following problem.
> I created table.
> {code}
> CREATE TABLE `spark_test` (
>   `id` INT(11) NULL,
>   `date` DATE NULL
> )
> COLLATE='utf8_general_ci'
> ENGINE=InnoDB
> ;
> INSERT INTO `spark_test` (`id`, `date`) VALUES (1, '1970-01-01');
> {code}
> Then I'm trying to read data - date '1970-01-01' is converted to int. This 
> makes data frame incompatible with its own schema.
> {code}
> df = 
> sqlCtx.read.jdbc("jdbc:mysql://host/sandbox?user=user&password=password", 
> 'spark_test')
> print(df.collect())
> df = sqlCtx.createDataFrame(df.rdd, df.schema)
> [Row(id=1, date=0)]
> ---
> TypeError Traceback (most recent call last)
>  in ()
>   1 df = 
> sqlCtx.read.jdbc("jdbc:mysql://a2.adpilot.co/sandbox?user=mbrynski&password=CebO3ax4",
>  'spark_test')
>   2 print(df.collect())
> > 3 df = sqlCtx.createDataFrame(df.rdd, df.schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio)
> 402 
> 403 if isinstance(data, RDD):
> --> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> 406 rdd, schema = self._createFromLocal(data, schema)
> /mnt/spark/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, 
> schema, samplingRatio)
> 296 rows = rdd.take(10)
> 297 for row in rows:
> --> 298 _verify_type(row, schema)
> 299 
> 300 else:
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1152  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1153 for v, f in zip(obj, dataType.fields):
> -> 1154 _verify_type(v, f.dataType)
>1155 
>1156 
> /mnt/spark/spark/python/pyspark/sql/types.py in _verify_type(obj, dataType)
>1136 # subclass of them can not be fromInternald in JVM
>1137 if type(obj) not in _acceptable_types[_type]:
> -> 1138 raise TypeError("%s can not accept object in type %s" % 
> (dataType, type(obj)))
>1139 
>1140 if isinstance(dataType, ArrayType):
> TypeError: DateType can not accept object in type 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10162) PySpark filters with datetimes mess up when datetimes have timezones.

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10162.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8555
[https://github.com/apache/spark/pull/8555]

> PySpark filters with datetimes mess up when datetimes have timezones.
> -
>
> Key: SPARK-10162
> URL: https://issues.apache.org/jira/browse/SPARK-10162
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Kevin Cox
> Fix For: 1.6.0
>
>
> PySpark appears to ignore timezone information when filtering on (and working 
> in general with) datetimes.
> Please see the example below. The generated filter in the query plan is 5 
> hours off (my computer is EST).
> {code}
> In [1]: df = sc.sql.createDataFrame([], StructType([StructField("dt", 
> TimestampType())]))
> In [2]: df.filter(df.dt > datetime(2000, 01, 01, tzinfo=UTC)).explain()
> Filter (dt#9 > 9467028)
>  Scan PhysicalRDD[dt#9]
> {code}
> Note that 9467028 == Sat  1 Jan 2000 05:00:00 UTC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10404) Worker should terminate previous executor before launch new one

2015-09-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10404:
--

 Summary: Worker should terminate previous executor before launch 
new one
 Key: SPARK-10404
 URL: https://issues.apache.org/jira/browse/SPARK-10404
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Reported here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hung-spark-executors-don-t-count-toward-worker-memory-limit-td16083.html#a24548

If new launched executor is overlapped with previous ones, they could run out 
of memory in the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page

2015-09-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10379:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.0)

> UnsafeShuffleExternalSorter should preserve first page
> --
>
> Key: SPARK-10379
> URL: https://issues.apache.org/jira/browse/SPARK-10379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID 
> 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable 
> to acquire 67108864 bytes of memory
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10403) UnsafeRowSerializer can't work with UnsafeShuffleManager (tungsten-sort)

2015-09-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10403:
--

 Summary: UnsafeRowSerializer can't work with UnsafeShuffleManager 
(tungsten-sort)
 Key: SPARK-10403
 URL: https://issues.apache.org/jira/browse/SPARK-10403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu


UnsafeRowSerializer reply on EOF in the stream, but UnsafeRowWriter will not 
write EOF between partitions.

{code}
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:122)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:174)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10379) UnsafeShuffleExternalSorter should preserve first page

2015-08-31 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10379:
--

 Summary: UnsafeShuffleExternalSorter should preserve first page
 Key: SPARK-10379
 URL: https://issues.apache.org/jira/browse/SPARK-10379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical



{code}

5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID 32686, 
lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException: Unable to acquire 
67108864 bytes of memory
at 
org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385)
at 
org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435)
at 
org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
at 
org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-08-31 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724009#comment-14724009
 ] 

Davies Liu commented on SPARK-10373:


[~mengxr] Do we want to add @since for the MLLib APIs in  1.5 docs ?

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10345:
--

 Summary: Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate
 Key: SPARK-10345
 URL: https://issues.apache.org/jira/browse/SPARK-10345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41759/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/nonblock_op_deduplicate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10341:
---
Target Version/s: 1.5.0  (was: 1.5.1)

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10323) NPE in code-gened In expression

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10323.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8492
[https://github.com/apache/spark/pull/8492]

> NPE in code-gened In expression
> ---
>
> Key: SPARK-10323
> URL: https://issues.apache.org/jira/browse/SPARK-10323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> To reproduce the problem, you can run {{null in ('str')}}. Let's also take a 
> look InSet and other similar expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10343) Consider nullability of expression in codegen

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10343:
--

 Summary: Consider nullability of expression in codegen
 Key: SPARK-10343
 URL: https://issues.apache.org/jira/browse/SPARK-10343
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Priority: Critical


In codegen, we didn't consider nullability of expressions. Once considering 
this, we can avoid lots of null check (reduce the size of generated code, also 
improve performance).

Before that, we should double-check the correctness of nullablity of all 
expressions and schema, or we will hit NPE or wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10342:
---
Issue Type: Improvement  (was: Story)

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10342) Cooperative memory management

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10342:
--

 Summary: Cooperative memory management
 Key: SPARK-10342
 URL: https://issues.apache.org/jira/browse/SPARK-10342
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


We have memory starving problems for a long time, it become worser in 1.5 since 
we use larger page.

In order to increase the memory usage (reduce unnecessary spilling) also reduce 
the risk of OOM, we should manage the memory in a cooperative way, it means all 
the memory consume should be also responsive to release memory (spilling) upon 
others' requests.

The requests of memory could be different, hard requirement (will crash if not 
allocated) or soft requirement (worse performance if not allocated). Also the 
costs of spilling are also different. We could introduce some kind of priority 
to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-28 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10341:
--

 Summary: SMJ fail with unable to acquire memory
 Key: SPARK-10341
 URL: https://issues.apache.org/jira/browse/SPARK-10341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


In SMJ, the first ExternalSorter could consume all the memory before spilling, 
then the second can not even acquire the first page.

{code}
ava.io.IOException: Unable to acquire 16777216 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10321) OrcRelation doesn't override sizeInBytes

2015-08-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-10321:
--

Assignee: Davies Liu

> OrcRelation doesn't override sizeInBytes
> 
>
> Key: SPARK-10321
> URL: https://issues.apache.org/jira/browse/SPARK-10321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
>Priority: Critical
>
> This hurts performance badly because broadcast join can never be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-08-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10309:
--

 Summary: Some tasks failed with Unable to acquire memory
 Key: SPARK-10309
 URL: https://issues.apache.org/jira/browse/SPARK-10309
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu


While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
executor):

{code}
java.io.IOException: Unable to acquire 33554432 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10305) PySpark createDataFrame on list of LabeledPoints fails (regression)

2015-08-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10305.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8470
[https://github.com/apache/spark/pull/8470]

> PySpark createDataFrame on list of LabeledPoints fails (regression)
> ---
>
> Key: SPARK-10305
> URL: https://issues.apache.org/jira/browse/SPARK-10305
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 1.5.0
>
>
> The following code works in 1.4 but fails in 1.5:
> {code}
> import numpy as np
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vectors
> lp1 = LabeledPoint(1.0, Vectors.sparse(5, np.array([0, 1]), np.array([2.0, 
> 21.0])))
> lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), np.array([2.0, 
> 21.0])))
> tmp = [lp1, lp2]
> sqlContext.createDataFrame(tmp).show()
> {code}
> The failure is:
> {code}
> ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with 
> StructType
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
>   6 lp2 = LabeledPoint(0.0, Vectors.sparse(5, np.array([2, 3]), 
> np.array([2.0, 21.0])))
>   7 tmp = [lp1, lp2]
> > 8 sqlContext.createDataFrame(tmp).show()
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> createDataFrame(self, data, schema, samplingRatio)
> 404 rdd, schema = self._createFromRDD(data, schema, 
> samplingRatio)
> 405 else:
> --> 406 rdd, schema = self._createFromLocal(data, schema)
> 407 jrdd = 
> self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
> 408 jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
> schema.json())
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> _createFromLocal(self, data, schema)
> 335 
> 336 # convert python objects to sql data
> --> 337 data = [schema.toInternal(row) for row in data]
> 338 return self._sc.parallelize(data), schema
> 339 
> /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in 
> toInternal(self, obj)
> 539 return tuple(f.toInternal(v) for f, v in 
> zip(self.fields, obj))
> 540 else:
> --> 541 raise ValueError("Unexpected tuple %r with 
> StructType" % obj)
> 542 else:
> 543 if isinstance(obj, dict):
> ValueError: Unexpected tuple LabeledPoint(1.0, (5,[0,1],[2.0,21.0])) with 
> StructType
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10302) NPE while save a DataFrame as ORC

2015-08-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-10302.
--
   Resolution: Duplicate
Fix Version/s: 1.5.0

> NPE while save a DataFrame as ORC
> -
>
> Key: SPARK-10302
> URL: https://issues.apache.org/jira/browse/SPARK-10302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377)
> at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377)
> at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:130)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240)
> ... 8 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10302) NPE while save a DataFrame as ORC

2015-08-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10302:
--

 Summary: NPE while save a DataFrame as ORC
 Key: SPARK-10302
 URL: https://issues.apache.org/jira/browse/SPARK-10302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Critical


{code}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$2.apply(HiveInspectors.scala:377)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:130)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:240)
... 8 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-26 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715512#comment-14715512
 ] 

Davies Liu commented on SPARK-9228:
---

[~jameszhouyi] unsafe.offHeap is another option that is separated to 
`sql.tungsten.enable`, which specify how tungsten manage the memory, on heap 
(default), or off heap.

> Combine unsafe and codegen into a single option
> ---
>
> Key: SPARK-9228
> URL: https://issues.apache.org/jira/browse/SPARK-9228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10245) SQLContext can't parse literal less than 0.1 ( 0.01)

2015-08-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10245:
--

 Summary: SQLContext can't parse literal less than 0.1 ( 0.01)
 Key: SPARK-10245
 URL: https://issues.apache.org/jira/browse/SPARK-10245
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


{code}

scala> sqlCtx.sql("select 0.01")
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
than precision (1).;
at org.apache.spark.sql.types.PrecisionInfo.(DecimalType.scala:32)
at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:40)
at 
org.apache.spark.sql.catalyst.SqlParser$$anonfun$numericLiteral$2$$anonfun$apply$216.apply(SqlParser.scala:335)
at 
org.apache.spark.sql.catalyst.SqlParser$$anonfun$numericLiteral$2$$anonfun$apply$216.apply(SqlParser.scala:334)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10196) Failed to save json data with a decimal type in the schema

2015-08-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10196.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8408
[https://github.com/apache/spark/pull/8408]

> Failed to save json data with a decimal type in the schema
> --
>
> Key: SPARK-10196
> URL: https://issues.apache.org/jira/browse/SPARK-10196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.5.0
>
>
> I try to save a dataset with a decimal type in it and I got
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 7 in stage 19.0 failed 4 times, most recent failure: Lost task 7.3 in 
> stage 19.0 (TID 932, 10.0.243.5): org.apache.spark.SparkException: Task 
> failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:391)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: (DecimalType(7,2),40.74) (of class scala.Tuple2)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:334)
>   ... 8 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818)
>   at org.apache.spark.SparkContext.r

[jira] [Commented] (SPARK-10215) Div of Decimal returns null

2015-08-24 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710696#comment-14710696
 ] 

Davies Liu commented on SPARK-10215:


I think we have not enough time to figure out the right solution for 1.5 
release, I'd like to target this for 1.6, dose this work for you? Or does this 
break any real use case?

> Div of Decimal returns null
> ---
>
> Key: SPARK-10215
> URL: https://issues.apache.org/jira/browse/SPARK-10215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> {code}
> val d = Decimal(1.12321)
> val df = Seq((d, 1)).toDF("a", "b")
> df.selectExpr("b * a / b").collect() => Array(Row(null))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8580) Test Parquet interoperability and compatibility with other libraries/systems

2015-08-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8580.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8392
[https://github.com/apache/spark/pull/8392]

> Test Parquet interoperability and compatibility with other libraries/systems
> 
>
> Key: SPARK-8580
> URL: https://issues.apache.org/jira/browse/SPARK-8580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.5.0
>
>
> As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
> to improve interoperability with other systems (reading non-standard Parquet 
> files they generate, and generating standard Parquet files), it would be good 
> to have a set of standard test Parquet files generated by various 
> systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
> versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7506) pyspark.sql.types.StructType.fromJson() is incorrectly named

2015-08-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-7506.
-
Resolution: Won't Fix
  Assignee: (was: Davies Liu)

These functions could be used by users (even we don't hope to), they can't be 
changed easily without breaking compatibility.

> pyspark.sql.types.StructType.fromJson() is incorrectly named
> 
>
> Key: SPARK-7506
> URL: https://issues.apache.org/jira/browse/SPARK-7506
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {code}
> >>> json_rdd = sqlContext.jsonRDD(sc.parallelize(['{"name": "Nick"}']))
> >>> json_rdd.schema
> StructType(List(StructField(name,StringType,true)))
> >>> type(json_rdd.schema)
> 
> >>> json_rdd.schema.json()
> '{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"}],"type":"struct"}'
> >>> pyspark.sql.types.StructType.fromJson(json_rdd.schema.json())
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Applications/apache-spark/spark-1.3.1-bin-hadoop2.4/python/pyspark/sql/types.py",
>  line 346, in fromJson
> return StructType([StructField.fromJson(f) for f in json["fields"]])
> TypeError: string indices must be integers, not str
> >>> import json
> >>> pyspark.sql.types.StructType.fromJson(json.loads(json_rdd.schema.json()))
> StructType(List(StructField(name,StringType,true)))
> >>>
> {code}
> So {{fromJson()}} doesn't actually expect JSON, which is a string. It expects 
> a dictionary.
> This method should probably be renamed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10177) Parquet support interprets timestamp values differently from Hive 0.14.0+

2015-08-24 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709878#comment-14709878
 ] 

Davies Liu commented on SPARK-10177:


[~lian cheng] After some investigation, just realized that we misunderstood the 
format used in Hive/parquet. the INT96 has two parts, day of julian day, nano 
seconds of the day. Because of an integer of julian day begin at noon of a day 
(12 pm), so they two parts are overlapped. They can't be interpreted as Julian 
day, and then add them together.

It's weird but not wrong, I will send out a PR to fix this soon.

> Parquet support interprets timestamp values differently from Hive 0.14.0+
> -
>
> Key: SPARK-10177
> URL: https://issues.apache.org/jira/browse/SPARK-10177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Attachments: 00_0
>
>
> Running the following SQL under Hive 0.14.0+ (tested against 0.14.0 and 
> 1.2.1):
> {code:sql}
> CREATE TABLE ts_test STORED AS PARQUET
> AS SELECT CAST("2015-01-01 00:00:00" AS TIMESTAMP);
> {code}
> Then read the Parquet file generated by Hive with Spark SQL:
> {noformat}
> scala> 
> sqlContext.read.parquet("hdfs://localhost:9000/user/hive/warehouse_hive14/ts_test").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([2015-01-01 12:00:00.0])
> {noformat}
> This issue can be easily reproduced with [this test case in PR 
> #8392|https://github.com/apache/spark/pull/8392/files#diff-1e55698cc579cbae676f827a89c2dc2eR116].
> Spark 1.4.1 works as expected in this case.
> 
> Update:
> Seems that the problem is that we do Julian day conversion wrong in 
> {{DateTimeUtils}}.  The following {{spark-shell}} session illustrates it:
> {code}
> import java.sql._
> import java.util._
> import org.apache.hadoop.hive.ql.io.parquet.timestamp._
> import org.apache.spark.sql.catalyst.util._
> TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
> val ts = Timestamp.valueOf("1970-01-01 00:00:00")
> val nt = NanoTimeUtils.getNanoTime(ts, false)
> val jts = DateTimeUtils.fromJulianDay(nt.getJulianDay, nt.getTimeOfDayNanos)
> DateTimeUtils.toJavaTimestamp(jts)
> // ==> java.sql.Timestamp = 1970-01-01 12:00:00.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9401) Fully implement code generation for ConcatWs

2015-08-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9401.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8353
[https://github.com/apache/spark/pull/8353]

> Fully implement code generation for ConcatWs
> 
>
> Key: SPARK-9401
> URL: https://issues.apache.org/jira/browse/SPARK-9401
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.6.0
>
>
> In ConcatWs, we fall back to interpreted mode if the input is a mix of string 
> and array of strings. We should have code gen for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9228) Combine unsafe and codegen into a single option

2015-08-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706289#comment-14706289
 ] 

Davies Liu commented on SPARK-9228:
---

Right now, it's an internal configuration (could be changed or removed in next 
release), we keep them only for debug purpose.

> Combine unsafe and codegen into a single option
> ---
>
> Key: SPARK-9228
> URL: https://issues.apache.org/jira/browse/SPARK-9228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9400) Implement code generation for StringLocate

2015-08-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9400.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8330
[https://github.com/apache/spark/pull/8330]

> Implement code generation for StringLocate
> --
>
> Key: SPARK-9400
> URL: https://issues.apache.org/jira/browse/SPARK-9400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-3533:
---

Re-open this for discussion.

> Add saveAsTextFileByKey() method to RDDs
> 
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10129) math function: stddev_samp

2015-08-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10129:
--

 Summary: math function: stddev_samp
 Key: SPARK-10129
 URL: https://issues.apache.org/jira/browse/SPARK-10129
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu


Use the STDDEV_SAMP function to return the standard deviation of a sample 
variance.

http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_stdev_samp.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.

2015-08-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703583#comment-14703583
 ] 

Davies Liu commented on SPARK-7379:
---

[~mengxr] This is a known issue that how to unpickle a bytearray from Pyrolite 
in Python 3 (it's OK in Python 2), is it related to that? I have not figured 
out a way to let Pyrolite compatible wit both Python 2 and 3.

> pickle.loads expects a string instead of bytes in Python 3.
> ---
>
> Key: SPARK-7379
> URL: https://issues.apache.org/jira/browse/SPARK-7379
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> In PickleSerializer, we call pickle.loads in Python 3. However, the input obj 
> could be bytes, which works in Python 2 but not 3.
> The error message is
> {code}
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py",
>  line 418, in loads
> return pickle.loads(obj, encoding=encoding)
> TypeError: must be a unicode character, not bytes
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow

2015-08-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703474#comment-14703474
 ] 

Davies Liu commented on SPARK-9644:
---

[~robbinspg] It's fixed by https://issues.apache.org/jira/browse/SPARK-10095

> Support update DecimalType with precision > 18 in UnsafeRow
> ---
>
> Key: SPARK-9644
> URL: https://issues.apache.org/jira/browse/SPARK-9644
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> Currently, we don't support using DecimalType with precision > 18 in new 
> unsafe aggregation, it's good to support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6798) Fix Date serialization in SparkR

2015-08-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702625#comment-14702625
 ] 

Davies Liu commented on SPARK-6798:
---

Not a bug, it's just not efficient. 

> Fix Date serialization in SparkR
> 
>
> Key: SPARK-6798
> URL: https://issues.apache.org/jira/browse/SPARK-6798
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Davies Liu
>Priority: Minor
>
> SparkR's date serialization right now sends strings from R to the JVM. We 
> should convert this to integers and also account for timezones correctly by 
> using DateUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6798) Fix Date serialization in SparkR

2015-08-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6798:
--
Issue Type: Improvement  (was: Bug)

> Fix Date serialization in SparkR
> 
>
> Key: SPARK-6798
> URL: https://issues.apache.org/jira/browse/SPARK-6798
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Davies Liu
>Priority: Minor
>
> SparkR's date serialization right now sends strings from R to the JVM. We 
> should convert this to integers and also account for timezones correctly by 
> using DateUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10107) NPE in format_number

2015-08-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10107:
--

 Summary: NPE in format_number
 Key: SPARK-10107
 URL: https://issues.apache.org/jira/browse/SPARK-10107
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


{code}
sqlContext.range(1<<20).selectExpr("format_number(id, 2)").show()
{code}

{code}
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:93)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.TungstenProject$$anonfun$2$$anonfun$apply$3.apply(basicOperators.scala:88)
at 
org.apache.spark.sql.execution.TungstenProject$$anonfun$2$$anonfun$apply$3.apply(basicOperators.scala:86)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:129)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10095) Should not use the private field of BigInteger

2015-08-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10095.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8286
[https://github.com/apache/spark/pull/8286]

> Should not use the private field of BigInteger
> --
>
> Key: SPARK-10095
> URL: https://issues.apache.org/jira/browse/SPARK-10095
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 1.5.0
>
>
> In UnsafeRow, we use the private field of BigInteger for better performance, 
> but it actually didn't contribute much to end-to-end runtime, and make it not 
> portable (may fail on other JVM implementations).
> So we should use the public API instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702390#comment-14702390
 ] 

Davies Liu commented on SPARK-9627:
---

I can reproduce it with latest master.

> SQL job failed if the dataframe with string columns is cached
> -
>
> Key: SPARK-9627
> URL: https://issues.apache.org/jira/browse/SPARK-9627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> r = random.Random()
> def gen(i):
> d = date.today() - timedelta(r.randint(0, 5000))
> cat = str(r.randint(0, 20)) * 5
> c = r.randint(0, 1000)
> price = decimal.Decimal(r.randint(0, 10)) / 100
> return (d, cat, c, price)
> schema = StructType().add('date', DateType()).add('cat', 
> StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))
> #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema)
> #df.show()
> #df.write.parquet('sales4')
> df = sqlContext.read.parquet('sales4')
> df.cache()
> df.count()
> df.show()
> print df.schema
> raw_input()
> r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
> print r.explain(True)
> r.show()
> {code}
> {code}
> StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))
> == Parsed Logical Plan ==
> 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS 
> sum((count * price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Analyzed Logical Plan ==
> date: date, cat: string, sum((count * price)): decimal(21,2)
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Optimized Logical Plan ==
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None
> == Physical Plan ==
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Final,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>Exchange hashpartitioning(date#0,cat#1)
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Partial,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
> (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)
> Code Generation: true
> == RDD ==
> None
> 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "t.py", line 34, in 
> r.show()
>   File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, 
> in show
> print(self._jdf.showString(n, truncate))
>   File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, 
> in __call__
> self.target_id, self.name)
>   File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in 
> deco
> return f(*a, **kw)
>   File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in 
> get_return_value
> format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty 
> list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPh

[jira] [Commented] (SPARK-9627) SQL job failed if the dataframe with string columns is cached

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702387#comment-14702387
 ] 

Davies Liu commented on SPARK-9627:
---

The `df.show()` will succeed, but `df.groupBy(df.date, 
df.cat).agg(sum(df['count'] * df.price))` will fail (you need to press enter to 
run it, or remove the `raw_input()` line)

> SQL job failed if the dataframe with string columns is cached
> -
>
> Key: SPARK-9627
> URL: https://issues.apache.org/jira/browse/SPARK-9627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Blocker
>
> {code}
> r = random.Random()
> def gen(i):
> d = date.today() - timedelta(r.randint(0, 5000))
> cat = str(r.randint(0, 20)) * 5
> c = r.randint(0, 1000)
> price = decimal.Decimal(r.randint(0, 10)) / 100
> return (d, cat, c, price)
> schema = StructType().add('date', DateType()).add('cat', 
> StringType()).add('count', ShortType()).add('price', DecimalType(5, 2))
> #df = sqlContext.createDataFrame(sc.range(1<<24).map(gen), schema)
> #df.show()
> #df.write.parquet('sales4')
> df = sqlContext.read.parquet('sales4')
> df.cache()
> df.count()
> df.show()
> print df.schema
> raw_input()
> r = df.groupBy(df.date, df.cat).agg(sum(df['count'] * df.price))
> print r.explain(True)
> r.show()
> {code}
> {code}
> StructType(List(StructField(date,DateType,true),StructField(cat,StringType,true),StructField(count,ShortType,true),StructField(price,DecimalType(5,2),true)))
> == Parsed Logical Plan ==
> 'Aggregate [date#0,cat#1], [date#0,cat#1,sum((count#2 * price#3)) AS 
> sum((count * price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Analyzed Logical Plan ==
> date: date, cat: string, sum((count * price)): decimal(21,2)
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  Relation[date#0,cat#1,count#2,price#3] 
> org.apache.spark.sql.parquet.ParquetRelation@5ec8f315
> == Optimized Logical Plan ==
> Aggregate [date#0,cat#1], 
> [date#0,cat#1,sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, DecimalType(11,2) AS sum((count * 
> price))#70]
>  InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None
> == Physical Plan ==
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Final,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>Exchange hashpartitioning(date#0,cat#1)
> NewAggregate with SortBasedAggregationIterator List(date#0, cat#1) 
> ArrayBuffer((sum((change_decimal_precision(CAST(CAST(count#2, 
> DecimalType(5,0)), DecimalType(11,2))) * 
> change_decimal_precision(CAST(price#3, 
> DecimalType(11,2)2,mode=Partial,isDistinct=false))
>  TungstenSort [date#0 ASC,cat#1 ASC], false, 0
>   ConvertToUnsafe
>InMemoryColumnarTableScan [date#0,cat#1,count#2,price#3], 
> (InMemoryRelation [date#0,cat#1,count#2,price#3], true, 1, 
> StorageLevel(true, true, false, true, 1), (PhysicalRDD 
> [date#0,cat#1,count#2,price#3], MapPartitionsRDD[3] at), None)
> Code Generation: true
> == RDD ==
> None
> 15/08/04 23:21:53 ERROR TaskSetManager: Task 0 in stage 4.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "t.py", line 34, in 
> r.show()
>   File "/Users/davies/work/spark/python/pyspark/sql/dataframe.py", line 258, 
> in show
> print(self._jdf.showString(n, truncate))
>   File "/Users/davies/work/spark/python/lib/py4j/java_gateway.py", line 538, 
> in __call__
> self.target_id, self.name)
>   File "/Users/davies/work/spark/python/pyspark/sql/utils.py", line 36, in 
> deco
> return f(*a, **kw)
>   File "/Users/davies/work/spark/python/lib/py4j/protocol.py", line 300, in 
> get_return_value
> format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 10, localhost): java.lang.UnsupportedOperationException: tail of empty 
> list
>   at scala.collection.immutable.N

[jira] [Created] (SPARK-10095) Should not use the private field of BigInteger

2015-08-18 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10095:
--

 Summary: Should not use the private field of BigInteger
 Key: SPARK-10095
 URL: https://issues.apache.org/jira/browse/SPARK-10095
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Minor


In UnsafeRow, we use the private field of BigInteger for better performance, 
but it actually didn't contribute much to end-to-end runtime, and make it not 
portable (may fail on other JVM implementations).

So we should use the public API instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701974#comment-14701974
 ] 

Davies Liu commented on SPARK-9644:
---

In a benchmark of doing aggregation on a decimal column (with precision > 18), 
this optimization only contribute 3% to end-to-end runtime, so I will revert 
this optimization to make it portable. Thanks, [~robbinspg]!

> Support update DecimalType with precision > 18 in UnsafeRow
> ---
>
> Key: SPARK-9644
> URL: https://issues.apache.org/jira/browse/SPARK-9644
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> Currently, we don't support using DecimalType with precision > 18 in new 
> unsafe aggregation, it's good to support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9644) Support update DecimalType with precision > 18 in UnsafeRow

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701891#comment-14701891
 ] 

Davies Liu commented on SPARK-9644:
---

[~robbinspg] Thanks for point this out, would is be possible to have different 
branch for them? The public API is very slow, we could have a fast path and 
slow path based on the OFFSET of signum and mag, or fast path for OpenJDK and 
slow path for others?

> Support update DecimalType with precision > 18 in UnsafeRow
> ---
>
> Key: SPARK-9644
> URL: https://issues.apache.org/jira/browse/SPARK-9644
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> Currently, we don't support using DecimalType with precision > 18 in new 
> unsafe aggregation, it's good to support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10056) PySpark Row - Support for row["columnName"] syntax

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701883#comment-14701883
 ] 

Davies Liu commented on SPARK-10056:


[~maver1ck], Yes, changes in Row class and unit test for this feature. Any docs 
would be better, but I'm not sure where is the best place to have it.

> PySpark Row - Support for row["columnName"] syntax
> --
>
> Key: SPARK-10056
> URL: https://issues.apache.org/jira/browse/SPARK-10056
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Maciej Bryński
>Priority: Minor
>
> Right now you can get Row element:
> - by column name: row.columnName
> - by index: row[index] (where index is integer)
> My proposition is to add following syntax:
> row["columnName"]
> It will be solution similar to DataFrame behaviour and should be quite easy 
> to implement with __getitem__ method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10079) Make `column` and `col` functions be S4 functions

2015-08-18 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701877#comment-14701877
 ] 

Davies Liu commented on SPARK-10079:


There are two public  col/column function in Scala and Python, which return an 
Column from colName, I think this one is useful, could be added into SparkR. 

> Make `column` and `col` functions be S4 functions
> -
>
> Key: SPARK-10079
> URL: https://issues.apache.org/jira/browse/SPARK-10079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> {{column}} and {{col}} function at {{R/pkg/R/Column.R}} are currently defined 
> as S3 functions. I think it would be better to define them as S4 functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10090) After division, Decimal may have longer precision than expected

2015-08-18 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10090:
--

 Summary: After division, Decimal may have longer precision than 
expected
 Key: SPARK-10090
 URL: https://issues.apache.org/jira/browse/SPARK-10090
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


In TPCDS Q59, the result should be DecimalType(37, 20), but got 
Decimal('0.69903637110664268591656984574863203607'), should be 
Decimal('0.69903637110664268592').



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5901) [PySpark] pickle classes in main module

2015-08-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-5901.
-
  Resolution: Invalid
Target Version/s:   (was: 1.5.0)

couldpickle does support to serialize class in __main__, but pickle does not 
support that.

> [PySpark] pickle classes in main module
> ---
>
> Key: SPARK-5901
> URL: https://issues.apache.org/jira/browse/SPARK-5901
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> Currently, couldpickle does not support to serialize class object in main 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10065) Avoid triple copy of var-length objects in Array in tungsten projection

2015-08-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10065:
--

 Summary: Avoid triple copy of var-length objects in Array in 
tungsten projection
 Key: SPARK-10065
 URL: https://issues.apache.org/jira/browse/SPARK-10065
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Davies Liu


The first copy happens when we calculate the size of each element, after that, 
we copy the elements into array buffer, finally we copy the array buffer into 
row buffer. 

We could calculate the total size first, then convert the elements into row 
buffer directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700044#comment-14700044
 ] 

Davies Liu commented on SPARK-9427:
---

Should we target this for 1.6? 

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10038) TungstenProject code generation fails when applied to array

2015-08-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-10038:
--

Assignee: Davies Liu

> TungstenProject code generation fails when applied to array
> ---
>
> Key: SPARK-10038
> URL: https://issues.apache.org/jira/browse/SPARK-10038
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Blocker
>
> During fuzz testing, I discovered that TungstenProject can crash when applied 
> to schemas that contain {{array}} columns.  As a minimal example, the 
> following code crashes in spark-shell:
> {code}
> sc.parallelize(Seq((Array(Array[Byte](1)), 1))).toDF.select("_1").rdd.count()
> {code}
> Here's the stacktrace:
> {code}
> 15/08/16 17:11:49 ERROR Executor: Exception in task 3.0 in stage 29.0 (TID 
> 144)
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: Line 53, Column 63: 
> '{' expected instead of '['
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> exprs) {
>   return new SpecificUnsafeProjection(exprs);
> }
> class SpecificUnsafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private UnsafeRow convertedStruct2;
>   private byte[] buffer3;
>   private int cursor4;
>   private UnsafeArrayData convertedArray6;
>   private byte[] buffer7;
>   public 
> SpecificUnsafeProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expressions) {
> this.expressions = expressions;
> this.convertedStruct2 = new UnsafeRow();
> this.buffer3 = new byte[16];
> this.cursor4 = 0;
> convertedArray6 = new UnsafeArrayData();
> buffer7 = new byte[64];
>   }
>   // Scala.Function1 need this
>   public Object apply(Object row) {
> return apply((InternalRow) row);
>   }
>   public UnsafeRow apply(InternalRow i) {
> cursor4 = 16;
> convertedStruct2.pointTo(buffer3, Platform.BYTE_ARRAY_OFFSET, 1, cursor4);
> /* input[0, ArrayType(BinaryType,true)] */
> boolean isNull0 = i.isNullAt(0);
> ArrayData primitive1 = isNull0 ? null : (i.getArray(0));
> final boolean isNull8 = isNull0;
> if (!isNull8) {
>   final ArrayData tmp9 = primitive1;
>   if (tmp9 instanceof UnsafeArrayData) {
> convertedArray6 = (UnsafeArrayData) tmp9;
>   } else {
> final int numElements10 = tmp9.numElements();
> final int fixedSize11 = 4 * numElements10;
> int numBytes12 = fixedSize11;
> final byte[][] elements13 = new byte[][numElements10];
> for (int index15 = 0; index15 < numElements10; index15++) {
>   if (!tmp9.isNullAt(index15)) {
> elements13[index15] = tmp9.getBinary(index15);
> numBytes12 += 
> org.apache.spark.sql.catalyst.expressions.UnsafeWriters$BinaryWriter.getSize(elements13[index15]);
>   }
> }
> if (numBytes12 > buffer7.length) {
>   buffer7 = new byte[numBytes12];
> }
> int cursor14 = fixedSize11;
> for (int index15 = 0; index15 < numElements10; index15++) {
>   if (elements13[index15] == null) {
> // If element is null, write the negative value address into 
> offset region.
> Platform.putInt(buffer7, Platform.BYTE_ARRAY_OFFSET + 4 * 
> index15, -cursor14);
>   } else {
> Platform.putInt(buffer7, Platform.BYTE_ARRAY_OFFSET + 4 * 
> index15, cursor14);
> cursor14 += 
> org.apache.spark.sql.catalyst.expressions.UnsafeWriters$BinaryWriter.write(
>   buffer7,
>   Platform.BYTE_ARRAY_OFFSET + cursor14,
>   elements13[index15]);
>   }
> }
> convertedArray6.pointTo(
>   buffer7,
>   Platform.BYTE_ARRAY_OFFSET,
>   numElements10,
>   numBytes12);
>   }
> }
> int numBytes16 = cursor4 + (isNull8 ? 0 : 
> org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$ArrayWriter.getSize(convertedArray6));
> if (buffer3.length < numBytes16) {
>   // This will not happen frequently, because the buffer is re-used.
>   byte[] tmpBuffer5 = new byte[numBytes16 * 2];
>   Platform.copyMemory(buffer3, Platform.BYTE_ARRAY_OFFSET,
> tmpBuffer5, Platform.BYTE_ARRAY_OFFSET, buffer3.length);
>   buffer3 = tmpBuffer5;
> }
> convertedStruct2.pointTo(buffer3, Platform.BYTE_ARRAY_OFFSET, 1, 
> numBytes16);
> if (isNull8) {
>   convertedStruct2.setNullAt(0);
> } else {
>   cursor4 += 
> org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$Arr

[jira] [Commented] (SPARK-10056) PySpark Row - Support for row["columnName"] syntax

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700028#comment-14700028
 ] 

Davies Liu commented on SPARK-10056:


We support accessing nested column by `df['a.b']`(it's the behavior came from 
Scala), should we also support that(I don't think so).

[~maver1ck] It will be great if you could send out a PR for this.

> PySpark Row - Support for row["columnName"] syntax
> --
>
> Key: SPARK-10056
> URL: https://issues.apache.org/jira/browse/SPARK-10056
> Project: Spark
>  Issue Type: Improvement
>Reporter: Maciej Bryński
>
> Right now you can get Row element:
> - by column name: row.columnName
> - by index: row[index] (where index is integer)
> My proposition is to add following syntax:
> row["columnName"]
> It will be solution similar to DataFrame behaviour and should be quite easy 
> to implement with __getitem__ method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9705) outdated Python 3 and IPython information

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699983#comment-14699983
 ] 

Davies Liu commented on SPARK-9705:
---

The PyLab thing is already fixed by https://github.com/apache/spark/pull/5111, 
I will send out a PR to fix others.

> outdated Python 3 and IPython information
> -
>
> Key: SPARK-9705
> URL: https://issues.apache.org/jira/browse/SPARK-9705
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Piotr Migdał
>Priority: Blocker
>  Labels: documentation
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 
> 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) 
> says explicitly:
> "Spark 1.4.1 works with Python 2.6 or higher (but not Python 3)."
> Affected:
> https://spark.apache.org/docs/1.4.0/programming-guide.html
> https://spark.apache.org/docs/1.4.1/programming-guide.html
> There are some other Python-related things, which are outdated, e.g. this 
> line:
> "For example, to launch the IPython Notebook with PyLab plot support:"
> (At least since IPython 3.0 PyLab/Matplotlib support happens inside a 
> notebook; and the line "--pylab inline" is already removed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9705) outdated Python 3 and IPython information

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699973#comment-14699973
 ] 

Davies Liu commented on SPARK-9705:
---

As the exception said, PySpark cannot run with different minor versions, you 
need to use the same minor version of Python, for example, use both Python 3.4 
on driver and workers.

Be default, PySpark use the default `python` found in PATH, if you have 
different default python in driver and worker, then you need to specify 
PYSPARK_PYTHON to tell pyspark which version of Python you want to use. For 
example: 

{code}
PYSPARK_PYTHON=the_path_of_python_3.4  bin/spark-submit xxx 
{code}

> outdated Python 3 and IPython information
> -
>
> Key: SPARK-9705
> URL: https://issues.apache.org/jira/browse/SPARK-9705
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Piotr Migdał
>Priority: Blocker
>  Labels: documentation
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> https://issues.apache.org/jira/browse/SPARK-4897 adds Python 3.4 support to 
> 1.4.0 and above, but the official docs (1.4.1, but the same is for 1.4.0) 
> says explicitly:
> "Spark 1.4.1 works with Python 2.6 or higher (but not Python 3)."
> Affected:
> https://spark.apache.org/docs/1.4.0/programming-guide.html
> https://spark.apache.org/docs/1.4.1/programming-guide.html
> There are some other Python-related things, which are outdated, e.g. this 
> line:
> "For example, to launch the IPython Notebook with PyLab plot support:"
> (At least since IPython 3.0 PyLab/Matplotlib support happens inside a 
> notebook; and the line "--pylab inline" is already removed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9822) Update doc about supported Python versions

2015-08-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-9822.
-
Resolution: Duplicate

> Update doc about supported Python versions
> --
>
> Key: SPARK-9822
> URL: https://issues.apache.org/jira/browse/SPARK-9822
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Not sure if here's the right place to post this, but the documentation on the 
> official website appears to be outdated. For example, for spark 1.4.0 and 
> 1.4.1 this paragraph (python tab) seems particularly misleading. Also, the 
> last line of this paragraph doesn't mention python 3 support. Maybe there are 
> other places.
> https://github.com/apache/spark/pull/5173#issuecomment-129903372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10059) Broken test: YarnClusterSuite

2015-08-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10059:
--

 Summary: Broken test: YarnClusterSuite
 Key: SPARK-10059
 URL: https://issues.apache.org/jira/browse/SPARK-10059
 Project: Spark
  Issue Type: Test
Reporter: Davies Liu
Priority: Critical


This test failed everytime:  
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/116/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/_It_is_not_a_test_/history/

{code}
Error Message

java.io.IOException: ResourceManager failed to start. Final state is STOPPED
Stacktrace

sbt.ForkMain$ForkError: java.io.IOException: ResourceManager failed to start. 
Final state is STOPPED
at 
org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:302)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster.access$500(MiniYARNCluster.java:87)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:422)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:104)
at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:46)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:46)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: ResourceManager failed to start. Final state 
is STOPPED
at 
org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:297)
... 18 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat

2015-08-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10058:
--

 Summary: Flaky test: HeartbeatReceiverSuite: normal heartbeat
 Key: SPARK-10058
 URL: https://issues.apache.org/jira/browse/SPARK-10058
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Reporter: Davies Liu
Assignee: Andrew Or


https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/

{code}
Error Message

3 did not equal 2
Stacktrace

sbt.ForkMain$ForkError: 3 did not equal 2
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply$mcV$sp(HeartbeatReceiverSuite.scala:104)
at 
org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
at 
org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HeartbeatReceiverSuite.scala:41)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.HeartbeatReceiverSuite.runTest(HeartbeatReceiverSuite.scala:41)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterAll$$super$run(HeartbeatReceiverSuite.scala:41)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at 
org.apache.spark.HeartbeatReceiverSuite.run(HeartbeatReceiverSuite.scala:41)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

This is really flaky (fail 30%)
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/Heartbe

[jira] [Commented] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699861#comment-14699861
 ] 

Davies Liu commented on SPARK-9982:
---

R has numeric (similar to double), no decimal, is it ok to turn all the 
DecimalType into Double in SparkR ? cc [~shivavam]

> SparkR DataFrame fail to return data of Decimal type
> 
>
> Key: SPARK-9982
> URL: https://issues.apache.org/jira/browse/SPARK-9982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.1
>Reporter: Alex Shkurenko
>
> Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but 
> with the Decimal datatype coming from a Postgres DB:
> //Set up SparkR
> >Sys.setenv(SPARK_HOME="/Users/ashkurenko/work/git_repos/spark")
> >Sys.setenv(SPARKR_SUBMIT_ARGS="--driver-class-path 
> >~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell")
> >.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> >library(SparkR)
> >sc <- sparkR.init(master="local")
> // Connect to a Postgres DB via JDBC
> >sqlContext <- sparkRSQL.init(sc)
> >sql(sqlContext, "
> CREATE TEMPORARY TABLE mytable 
> USING org.apache.spark.sql.jdbc 
> OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
> ,dbtable 'mydbtable'
> )
> ")
> // Try pulling a Decimal column from a table
> >myDataFrame <- sql(sqlContext,("select a_decimal_column  from mytable "))
> // The schema shows up fine
> >show(myDataFrame)
> DataFrame[a_decimal_column:decimal(10,0)]
> >schema(myDataFrame)
> StructType
> |-name = "a_decimal_column", type = "DecimalType(10,0)", nullable = TRUE
> // ... but pulling data fails:
> localDF <- collect(myDataFrame)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699861#comment-14699861
 ] 

Davies Liu edited comment on SPARK-9982 at 8/17/15 5:21 PM:


R has numeric (similar to double), no decimal, is it ok to turn all the 
DecimalType into Double in SparkR ? cc @Shivaram Venkataraman


was (Author: davies):
R has numeric (similar to double), no decimal, is it ok to turn all the 
DecimalType into Double in SparkR ? cc [~shivavam]

> SparkR DataFrame fail to return data of Decimal type
> 
>
> Key: SPARK-9982
> URL: https://issues.apache.org/jira/browse/SPARK-9982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.1
>Reporter: Alex Shkurenko
>
> Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but 
> with the Decimal datatype coming from a Postgres DB:
> //Set up SparkR
> >Sys.setenv(SPARK_HOME="/Users/ashkurenko/work/git_repos/spark")
> >Sys.setenv(SPARKR_SUBMIT_ARGS="--driver-class-path 
> >~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell")
> >.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> >library(SparkR)
> >sc <- sparkR.init(master="local")
> // Connect to a Postgres DB via JDBC
> >sqlContext <- sparkRSQL.init(sc)
> >sql(sqlContext, "
> CREATE TEMPORARY TABLE mytable 
> USING org.apache.spark.sql.jdbc 
> OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
> ,dbtable 'mydbtable'
> )
> ")
> // Try pulling a Decimal column from a table
> >myDataFrame <- sql(sqlContext,("select a_decimal_column  from mytable "))
> // The schema shows up fine
> >show(myDataFrame)
> DataFrame[a_decimal_column:decimal(10,0)]
> >schema(myDataFrame)
> StructType
> |-name = "a_decimal_column", type = "DecimalType(10,0)", nullable = TRUE
> // ... but pulling data fails:
> localDF <- collect(myDataFrame)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9427) Add expression functions in SparkR

2015-08-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699843#comment-14699843
 ] 

Davies Liu commented on SPARK-9427:
---

[~yu_ishikawa] `rand` does work in PySpark (Python 2.7):
{code}
>>> sqlContext.range(10).select(rand(2), "id").show()
+---+---+
| rand()| id|
+---+---+
| 0.6038577325006693|  0|
| 0.6319470735268434|  1|
|0.22327628846133507|  2|
|0.24223739932588373|  3|
| 0.8395518879513995|  4|
| 0.5662927043813443|  5|
| 0.2057736041310516|  6|
| 0.3408245196642603|  7|
|0.08641290347537589|  8|
|0.46561147527615276|  9|
+---+---+
{code}

> Add expression functions in SparkR
> --
>
> Key: SPARK-9427
> URL: https://issues.apache.org/jira/browse/SPARK-9427
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> The list of functions to add is based on SQL's functions. And it would be 
> better to add them in one shot PR.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder

2015-08-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10057:
--

 Summary: Faill to load class org.slf4j.impl.StaticLoggerBinder
 Key: SPARK-10057
 URL: https://issues.apache.org/jira/browse/SPARK-10057
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Davies Liu



Some loggings are dropped, because it can't load class 
"org.slf4j.impl.StaticLoggerBinder"
{code}
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC

2015-08-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698121#comment-14698121
 ] 

Davies Liu commented on SPARK-9734:
---

[~rmullapudi] It's fixed by https://github.com/apache/spark/pull/8219

> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
> -
>
> Key: SPARK-9734
> URL: https://issues.apache.org/jira/browse/SPARK-9734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> When using a basic example of reading the EMP table from Redshift via 
> spark-redshift, and writing the data back to Redshift, Spark fails with the 
> below error, related to Numeric/Decimal data types.
> Redshift table:
> {code}
> testdb=# \d emp
>   Table "public.emp"
>   Column  | Type  | Modifiers
> --+---+---
>  empno| integer   |
>  ename| character varying(10) |
>  job  | character varying(9)  |
>  mgr  | integer   |
>  hiredate | date  |
>  sal  | numeric(7,2)  |
>  comm | numeric(7,2)  |
>  deptno   | integer   |
> testdb=# select * from emp;
>  empno | ename  |job| mgr  |  hiredate  |   sal   |  comm   | deptno
> ---++---+--++-+-+
>   7369 | SMITH  | CLERK | 7902 | 1980-12-17 |  800.00 |NULL | 20
>   7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 | 30
>   7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30
>   7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |NULL | 10
>   7839 | KING   | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10
>   7876 | ADAMS  | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20
>   7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |NULL | 20
>   7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 | 30
>   7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |NULL | 20
>   7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |NULL | 30
>   7788 | SCOTT  | ANALYST   | 7566 | 1982-12-09 | 3000.00 |NULL | 20
>   7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |0.00 | 30
>   7900 | JAMES  | CLERK | 7698 | 1981-12-03 |  950.00 |NULL | 30
>   7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10
> (14 rows)
> {code}
> Spark Code:
> {code}
> val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx"
> val driver = "com.amazon.redshift.jdbc41.Driver"
> val t = 
> sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", 
> driver).option("url", url).option("dbtable", "emp").option("tempdir", 
> "s3n://spark-temp-dir").load()
> t.registerTempTable("SparkTempTable")
> val t1 = sqlContext.sql("select * from SparkTempTable")
> t1.write.format("com.databricks.spark.redshift").option("driver", 
> driver).option("url", url).option("dbtable", "t1").option("tempdir", 
> "s3n://spark-temp-dir").option("avrocompression", 
> "snappy").mode("error").save()
> {code}
> Error Stack:
> {code}
> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.schemaString(jdbc.scala:132)
>   at 
> org.apache.spark.sql.jdbc.JDBCWrapper.schemaString(RedshiftJDBCWrapper.scala:28)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.createTableSql(RedshiftWriter.scala:39)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.doRedshiftLoad(RedshiftWriter.scala:105)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:145)
>   at 
> com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:92)
>   at org.apache.spark.sql.sources.ResolvedD

[jira] [Resolved] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9725.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8210
[https://github.com/apache/spark/pull/8210]

> spark sql query string field return empty/garbled string
> 
>
> Key: SPARK-9725
> URL: https://issues.apache.org/jira/browse/SPARK-9725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> to reproduce it:
> 1 deploy spark cluster mode, i use standalone mode locally
> 2 set executor memory >= 32g, set following config in spark-default.xml
>spark.executor.memory36g 
> 3 run spark-sql.sh with "show tables" it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC

2015-08-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697998#comment-14697998
 ] 

Davies Liu commented on SPARK-9734:
---

[~rmullapudi] Good catch! I will sending out a quick fix!

> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
> -
>
> Key: SPARK-9734
> URL: https://issues.apache.org/jira/browse/SPARK-9734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> When using a basic example of reading the EMP table from Redshift via 
> spark-redshift, and writing the data back to Redshift, Spark fails with the 
> below error, related to Numeric/Decimal data types.
> Redshift table:
> {code}
> testdb=# \d emp
>   Table "public.emp"
>   Column  | Type  | Modifiers
> --+---+---
>  empno| integer   |
>  ename| character varying(10) |
>  job  | character varying(9)  |
>  mgr  | integer   |
>  hiredate | date  |
>  sal  | numeric(7,2)  |
>  comm | numeric(7,2)  |
>  deptno   | integer   |
> testdb=# select * from emp;
>  empno | ename  |job| mgr  |  hiredate  |   sal   |  comm   | deptno
> ---++---+--++-+-+
>   7369 | SMITH  | CLERK | 7902 | 1980-12-17 |  800.00 |NULL | 20
>   7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 | 30
>   7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30
>   7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |NULL | 10
>   7839 | KING   | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10
>   7876 | ADAMS  | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20
>   7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |NULL | 20
>   7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 | 30
>   7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |NULL | 20
>   7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |NULL | 30
>   7788 | SCOTT  | ANALYST   | 7566 | 1982-12-09 | 3000.00 |NULL | 20
>   7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |0.00 | 30
>   7900 | JAMES  | CLERK | 7698 | 1981-12-03 |  950.00 |NULL | 30
>   7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10
> (14 rows)
> {code}
> Spark Code:
> {code}
> val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx"
> val driver = "com.amazon.redshift.jdbc41.Driver"
> val t = 
> sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", 
> driver).option("url", url).option("dbtable", "emp").option("tempdir", 
> "s3n://spark-temp-dir").load()
> t.registerTempTable("SparkTempTable")
> val t1 = sqlContext.sql("select * from SparkTempTable")
> t1.write.format("com.databricks.spark.redshift").option("driver", 
> driver).option("url", url).option("dbtable", "t1").option("tempdir", 
> "s3n://spark-temp-dir").option("avrocompression", 
> "snappy").mode("error").save()
> {code}
> Error Stack:
> {code}
> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:132)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.schemaString(jdbc.scala:132)
>   at 
> org.apache.spark.sql.jdbc.JDBCWrapper.schemaString(RedshiftJDBCWrapper.scala:28)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.createTableSql(RedshiftWriter.scala:39)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.doRedshiftLoad(RedshiftWriter.scala:105)
>   at 
> com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:145)
>   at 
> com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:92)
>   at org.apache.spark.sql.sources.ResolvedDataSource$.a

[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697868#comment-14697868
 ] 

Davies Liu commented on SPARK-9971:
---

We had a long discussion about how to support NaN, but didn't find a good one 
from other databases to follow (they all handle them in different way from each 
other), then we came out a plan, see 
https://issues.apache.org/jira/browse/SPARK-9079. 

For 4 (in NaN in aggregation), we have not decided yet. The current behavior is 
not consistent across functions

{code}
>>> sf = sqlContext.createDataFrame([(1.0,), (float('nan'),)], ['f'])
>>> sf.selectExpr('min(f)', 'max(f)', 'sum(f)', 'avg(f)', 'count(f)').show()
+---+---+---+---+-+
|'min(f)|'max(f)|'sum(f)|'avg(f)|'count(f)|
+---+---+---+---+-+
|1.0|NaN|NaN|NaN|2|
+---+---+---+---+-+
{code}

cc [~yhuai] [~rxin] [~joshrosen]

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9978) Window functions require partitionBy to work as expected

2015-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-9978:
-

Assignee: Davies Liu

> Window functions require partitionBy to work as expected
> 
>
> Key: SPARK-9978
> URL: https://issues.apache.org/jira/browse/SPARK-9978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Maciej Szymkiewicz
>Assignee: Davies Liu
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM 
> df").show()
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> ++--+
> |   x|rn|
> ++--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> ++--+
> {code}
> As a workaround It is possible to call partitionBy without additional 
> arguments:
> {code}
> df.select(
> df["x"],
> rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive 
> considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: 
> Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8

2015-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9589.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8177
[https://github.com/apache/spark/pull/8177]

> Flaky test: HiveCompatibilitySuite.groupby8
> ---
>
> Key: SPARK-9589
> URL: https://issues.apache.org/jira/browse/SPARK-9589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/
> {code}
> sbt.ForkMain$ForkError: 
> Failed to execute query using catalyst:
> Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 
> times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, 
> localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226)
>   at 
> org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366)
>   at 
> org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.(UnsafeHybridAggregationIterator.scala:290)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358)
>   at 
> org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130)
>   at 
> org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9946) NPE in TaskMemoryManager

2015-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9946.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8177
[https://github.com/apache/spark/pull/8177]

> NPE in TaskMemoryManager
> 
>
> Key: SPARK-9946
> URL: https://issues.apache.org/jira/browse/SPARK-9946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> {code}
> Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 6 in stage 6801.0 
> failed 1 times, most recent failure: Lost task 6.0 in stage 6801.0 (TID 
> 41123, localhost): java.lang.NullPointerException
> [info]at 
> org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226)
> [info]at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:165)
> [info]at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:142)
> [info]at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:129)
> [info]at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
> [info]at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:302)
> [info]at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:218)
> [info]at 
> org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:110)
> [info]at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> [info]at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> [info]at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> [info]at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
> [info]at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:88)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 13:27:17.435 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 23.0 
> in stage 6801.0 (TID 41140, localhost): TaskKilled (killed intentionally)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 13:27:17.436 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.0 
> in stage 6801.0 (TID 41129, localhost): TaskKilled (killed intentionally)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9589) Flaky test: HiveCompatibilitySuite.groupby8

2015-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-9589:
-

Assignee: Davies Liu  (was: Josh Rosen)

> Flaky test: HiveCompatibilitySuite.groupby8
> ---
>
> Key: SPARK-9589
> URL: https://issues.apache.org/jira/browse/SPARK-9589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39662/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/groupby8/
> {code}
> sbt.ForkMain$ForkError: 
> Failed to execute query using catalyst:
> Error: Job aborted due to stage failure: Task 24 in stage 3081.0 failed 1 
> times, most recent failure: Lost task 24.0 in stage 3081.0 (TID 14919, 
> localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:226)
>   at 
> org.apache.spark.unsafe.map.BytesToBytesMap$Location.updateAddressesAndSizes(BytesToBytesMap.java:366)
>   at 
> org.apache.spark.unsafe.map.BytesToBytesMap$Location.putNewKey(BytesToBytesMap.java:600)
>   at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.getAggregationBuffer(UnsafeFixedWidthAggregationMap.java:134)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.initialize(UnsafeHybridAggregationIterator.scala:276)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator.(UnsafeHybridAggregationIterator.scala:290)
>   at 
> org.apache.spark.sql.execution.aggregate.UnsafeHybridAggregationIterator$.createFromInputIterator(UnsafeHybridAggregationIterator.scala:358)
>   at 
> org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:130)
>   at 
> org.apache.spark.sql.execution.aggregate.Aggregate$$anonfun$doExecute$1$$anonfun$5.apply(Aggregate.scala:121)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697426#comment-14697426
 ] 

Davies Liu edited comment on SPARK-9725 at 8/14/15 5:36 PM:


I can reproduce this issue now (<32G for executor, > 32G for driver).

Many things are broken:
{code}
scala> sqlContext.sql("select 'abc'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([???])

scala> sqlContext.range(10).registerTempTable("t")

scala> sqlContext.sql("select 'abc',id from t").collect()
res2: Array[org.apache.spark.sql.Row] = Array([???,0], [???,1], [???,2], 
[???,3], [???,4], [???,5], [???,6], [???,7], [???,8], [???,9])
{code}


was (Author: davies):
I can reproduce this issue now (<32G for executor, > 32G for driver).

> spark sql query string field return empty/garbled string
> 
>
> Key: SPARK-9725
> URL: https://issues.apache.org/jira/browse/SPARK-9725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>Assignee: Davies Liu
>Priority: Blocker
>
> to reproduce it:
> 1 deploy spark cluster mode, i use standalone mode locally
> 2 set executor memory >= 32g, set following config in spark-default.xml
>spark.executor.memory36g 
> 3 run spark-sql.sh with "show tables" it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    9   10   11   12   13   14   15   16   17   18   >