[jira] [Created] (SPARK-17763) JacksonParser silently parses null as 0 when the field is not nullable

2016-10-03 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17763:


 Summary: JacksonParser silently parses null as 0 when the field is 
not nullable
 Key: SPARK-17763
 URL: https://issues.apache.org/jira/browse/SPARK-17763
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon


It seems {{JacksonParser}} parses the json when the field is not nullable, it 
parses some data as 0 silently. For example,

{code}
val testJson = """{"nullInt":null}""" :: Nil
val testSchema = StructType(StructField("nullInt", IntegerType, false) :: Nil)
val data = spark.sparkContext.parallelize(testJson)
spark.read.schema(testSchema).json(data).show()
{code}

shows

{code}
+---+
|nullInt|
+---+
|  0|
+---+
{code}

This will also affect {{from_json}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17763) JacksonParser silently parses null as 0 when the field is not nullable

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17763:


Assignee: Apache Spark

> JacksonParser silently parses null as 0 when the field is not nullable
> --
>
> Key: SPARK-17763
> URL: https://issues.apache.org/jira/browse/SPARK-17763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> It seems {{JacksonParser}} parses the json when the field is not nullable, it 
> parses some data as 0 silently. For example,
> {code}
> val testJson = """{"nullInt":null}""" :: Nil
> val testSchema = StructType(StructField("nullInt", IntegerType, false) :: Nil)
> val data = spark.sparkContext.parallelize(testJson)
> spark.read.schema(testSchema).json(data).show()
> {code}
> shows
> {code}
> +---+
> |nullInt|
> +---+
> |  0|
> +---+
> {code}
> This will also affect {{from_json}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17763) JacksonParser silently parses null as 0 when the field is not nullable

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17763:


Assignee: (was: Apache Spark)

> JacksonParser silently parses null as 0 when the field is not nullable
> --
>
> Key: SPARK-17763
> URL: https://issues.apache.org/jira/browse/SPARK-17763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems {{JacksonParser}} parses the json when the field is not nullable, it 
> parses some data as 0 silently. For example,
> {code}
> val testJson = """{"nullInt":null}""" :: Nil
> val testSchema = StructType(StructField("nullInt", IntegerType, false) :: Nil)
> val data = spark.sparkContext.parallelize(testJson)
> spark.read.schema(testSchema).json(data).show()
> {code}
> shows
> {code}
> +---+
> |nullInt|
> +---+
> |  0|
> +---+
> {code}
> This will also affect {{from_json}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17763) JacksonParser silently parses null as 0 when the field is not nullable

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541825#comment-15541825
 ] 

Apache Spark commented on SPARK-17763:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15329

> JacksonParser silently parses null as 0 when the field is not nullable
> --
>
> Key: SPARK-17763
> URL: https://issues.apache.org/jira/browse/SPARK-17763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems {{JacksonParser}} parses the json when the field is not nullable, it 
> parses some data as 0 silently. For example,
> {code}
> val testJson = """{"nullInt":null}""" :: Nil
> val testSchema = StructType(StructField("nullInt", IntegerType, false) :: Nil)
> val data = spark.sparkContext.parallelize(testJson)
> spark.read.schema(testSchema).json(data).show()
> {code}
> shows
> {code}
> +---+
> |nullInt|
> +---+
> |  0|
> +---+
> {code}
> This will also affect {{from_json}} function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-03 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17764:


 Summary: to_json function for parsing Structs to json Strings
 Key: SPARK-17764
 URL: https://issues.apache.org/jira/browse/SPARK-17764
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon


After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
have {{to_json}} too, in particular, in the case to write out dataframes by 
some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541842#comment-15541842
 ] 

Hyukjin Kwon commented on SPARK-17764:
--

Hi [~marmbrus], if no one is working one this and it seems worth being added, 
i'd like to proceed this. I would appreciate if you give me some feedback.

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17734) inner equi-join shorthand that returns Datasets, like DataFrame already has

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541870#comment-15541870
 ] 

Dongjoon Hyun commented on SPARK-17734:
---

Hi, [~pdxleif].
How do you think about the following?
```
table1.join(table2, table1.col("value") === table2.col("value"))
```

> inner equi-join shorthand that returns Datasets, like DataFrame already has
> ---
>
> Key: SPARK-17734
> URL: https://issues.apache.org/jira/browse/SPARK-17734
> Project: Spark
>  Issue Type: Wish
>Reporter: Leif Warner
>Priority: Minor
>
> There's an existing ".join(right: Dataset[_], usingColumn: String): 
> DataFrame" method on Dataset.
> Would appreciate it if a variant that returns typed Datasets would also 
> available.
> If you write a join that contains the common column name name, you get an 
> AnalysisException thrown because that's ambiguous, e.g:
> $"foo" === $"foo"
> So I wrote table1.toDF()("foo") === table2.toDF()("foo"), but that's a little 
> error prone, and coworkers considered it a hack and didn't want to use it, 
> because it "mixes DataFrame and Dataset api".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-10-03 Thread Xing Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541946#comment-15541946
 ] 

Xing Shi commented on SPARK-17465:
--

Resolved.

In every task, method _currentUnrollMemory_ will be called several times. It 
will scan all keys of _unrollMemoryMap_ and _ pendingUnrollMemoryMap_, so the 
processing time is proportional to the map size.
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L540-L542

I have checked the processing time of _currentUnrollMemory_. It just equals to 
the time increased from before.

Hope this will help someone who has a similar issue of increasing processing 
time when upgrade Spark to 1.6.0 :)

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-10-03 Thread Xing Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541946#comment-15541946
 ] 

Xing Shi edited comment on SPARK-17465 at 10/3/16 9:17 AM:
---

Resolved.

In every task, method _currentUnrollMemory_ will be called several times. It 
will scan all keys of _unrollMemoryMap_ and _pendingUnrollMemoryMap_, so the 
processing time is proportional to the map size.
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L540-L542

I have checked the processing time of _currentUnrollMemory_. It just equals to 
the time increased from before.

Hope this will help someone who has a similar issue of increasing processing 
time when upgrade Spark to 1.6.0 :)


was (Author: saturday_s):
Resolved.

In every task, method _currentUnrollMemory_ will be called several times. It 
will scan all keys of _unrollMemoryMap_ and _ pendingUnrollMemoryMap_ , so the 
processing time is proportional to the map size.
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L540-L542

I have checked the processing time of _currentUnrollMemory_. It just equals to 
the time increased from before.

Hope this will help someone who has a similar issue of increasing processing 
time when upgrade Spark to 1.6.0 :)

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.

[jira] [Comment Edited] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-10-03 Thread Xing Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541946#comment-15541946
 ] 

Xing Shi edited comment on SPARK-17465 at 10/3/16 9:16 AM:
---

Resolved.

In every task, method _currentUnrollMemory_ will be called several times. It 
will scan all keys of _unrollMemoryMap_ and _ pendingUnrollMemoryMap_ , so the 
processing time is proportional to the map size.
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L540-L542

I have checked the processing time of _currentUnrollMemory_. It just equals to 
the time increased from before.

Hope this will help someone who has a similar issue of increasing processing 
time when upgrade Spark to 1.6.0 :)


was (Author: saturday_s):
Resolved.

In every task, method _currentUnrollMemory_ will be called several times. It 
will scan all keys of _unrollMemoryMap_ and _ pendingUnrollMemoryMap_, so the 
processing time is proportional to the map size.
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L540-L542

I have checked the processing time of _currentUnrollMemory_. It just equals to 
the time increased from before.

Hope this will help someone who has a similar issue of increasing processing 
time when upgrade Spark to 1.6.0 :)

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark

[jira] [Resolved] (SPARK-17598) User-friendly name for Spark Thrift Server in web UI

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17598.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15268
[https://github.com/apache/spark/pull/15268]

> User-friendly name for Spark Thrift Server in web UI
> 
>
> Key: SPARK-17598
> URL: https://issues.apache.org/jira/browse/SPARK-17598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: spark-thriftserver-webui.png
>
>
> The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the 
> class, i.e. {{org.apache.spark.sql.hive.thrift.HiveThriftServer2}}. We could 
> do better and call it {{Thrift JDBC/ODBC Server}} (like {{Spark shell}} for 
> spark-shell).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17598) User-friendly name for Spark Thrift Server in web UI

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17598:
--
Assignee: Alex Bozarth
Priority: Trivial  (was: Minor)

> User-friendly name for Spark Thrift Server in web UI
> 
>
> Key: SPARK-17598
> URL: https://issues.apache.org/jira/browse/SPARK-17598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.0.1
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
>Priority: Trivial
> Fix For: 2.1.0
>
> Attachments: spark-thriftserver-webui.png
>
>
> The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the 
> class, i.e. {{org.apache.spark.sql.hive.thrift.HiveThriftServer2}}. We could 
> do better and call it {{Thrift JDBC/ODBC Server}} (like {{Spark shell}} for 
> spark-shell).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming

2016-10-03 Thread Xing Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15541970#comment-15541970
 ] 

Xing Shi commented on SPARK-13288:
--

This problem is just similar to 
[SPARK-17465|https://issues.apache.org/jira/browse/SPARK-17465]. Maybe they are 
caused by the same reason.

> [1.6.0] Memory leak in Spark streaming
> --
>
> Key: SPARK-13288
> URL: https://issues.apache.org/jira/browse/SPARK-13288
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Bare metal cluster
> RHEL 6.6
>Reporter: JESSE CHEN
>  Labels: streaming
>
> Streaming in 1.6 seems to have a memory leak.
> Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 
> showed a gradual increasing processing time. 
> The app is simple: 1 Kafka receiver of tweet stream and 20 executors 
> processing the tweets in 5-second batches. 
> Spark 1.5.0 handles this smoothly and did not show increasing processing time 
> in the 40-minute test; but 1.6 showed increasing time about 8 minutes into 
> the test. Please see chart here:
> https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116
> I captured heap dumps in two version and did a comparison. I noticed the Byte 
> is using 50X more space in 1.5.1.
> Here are some top classes in heap histogram and references. 
> Heap Histogram
>   
> All Classes (excluding platform)  
>   1.6.0 Streaming 1.5.1 Streaming 
> Class Instance Count  Total Size  Class   Instance Count  Total 
> Size
> class [B  84533,227,649,599   class [B5095
> 62,938,466
> class [C  44682   4,255,502   class [C130482  
> 12,844,182
> class java.lang.reflect.Method90591,177,670   class 
> java.lang.String  130171  1,562,052
>   
>   
> References by TypeReferences by Type  
>   
> class [B [0x640039e38]class [B [0x6c020bb08]  
> 
>   
> Referrers by Type Referrers by Type   
>   
> Class Count   Class   Count   
> java.nio.HeapByteBuffer   3239
> sun.security.util.DerInputBuffer1233
> sun.security.util.DerInputBuffer  1233
> sun.security.util.ObjectIdentifier  620 
> sun.security.util.ObjectIdentifier620 [[B 397 
> [Ljava.lang.Object;   408 java.lang.reflect.Method
> 326 
> 
> The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0.
> The Java.nio.HeapByteBuffer referencing class did not show up in top in 
> 1.5.1. 
> I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them 
> here
> https://ibm.box.com/sparkstreaming-jstack160
> https://ibm.box.com/sparkstreaming-jstack151
> Jesse 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17718) Make loss function formulation label note clearer in MLlib docs

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17718:
--
Summary: Make loss function formulation label note clearer in MLlib docs  
(was: Update MLib Classification Documentation )

> Make loss function formulation label note clearer in MLlib docs
> ---
>
> Key: SPARK-17718
> URL: https://issues.apache.org/jira/browse/SPARK-17718
> Project: Spark
>  Issue Type: Documentation
>Reporter: Tobi Bosede
>Priority: Minor
>
> https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal
> The loss function here for logistic regression is confusing. It seems to 
> imply that spark uses only -1 and 1 class labels. However it uses 0,1.  Note 
> below needs to make this point more visible to avoid confusion.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."
> Better yet, the loss function should be replaced with that for 0, 1 despite 
> mathematical inconvenience, since that is what is actually implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17718) Make loss function formulation label note clearer in MLlib docs

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17718:
--
Assignee: Sean Owen
Priority: Trivial  (was: Minor)

> Make loss function formulation label note clearer in MLlib docs
> ---
>
> Key: SPARK-17718
> URL: https://issues.apache.org/jira/browse/SPARK-17718
> Project: Spark
>  Issue Type: Documentation
>Reporter: Tobi Bosede
>Assignee: Sean Owen
>Priority: Trivial
>
> https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal
> The loss function here for logistic regression is confusing. It seems to 
> imply that spark uses only -1 and 1 class labels. However it uses 0,1.  Note 
> below needs to make this point more visible to avoid confusion.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."
> Better yet, the loss function should be replaced with that for 0, 1 despite 
> mathematical inconvenience, since that is what is actually implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17718) Make loss function formulation label note clearer in MLlib docs

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542004#comment-15542004
 ] 

Apache Spark commented on SPARK-17718:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15330

> Make loss function formulation label note clearer in MLlib docs
> ---
>
> Key: SPARK-17718
> URL: https://issues.apache.org/jira/browse/SPARK-17718
> Project: Spark
>  Issue Type: Documentation
>Reporter: Tobi Bosede
>Assignee: Sean Owen
>Priority: Trivial
>
> https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal
> The loss function here for logistic regression is confusing. It seems to 
> imply that spark uses only -1 and 1 class labels. However it uses 0,1.  Note 
> below needs to make this point more visible to avoid confusion.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."
> Better yet, the loss function should be replaced with that for 0, 1 despite 
> mathematical inconvenience, since that is what is actually implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17736.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15309
[https://github.com/apache/spark/pull/15309]

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
> Fix For: 2.1.0
>
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17736:
--
Assignee: Jagadeesan A S
Priority: Trivial  (was: Major)

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Jagadeesan A S
>Priority: Trivial
> Fix For: 2.1.0
>
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17736:
--
Fix Version/s: 2.0.2

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Jagadeesan A S
>Priority: Trivial
> Fix For: 2.0.2, 2.1.0
>
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17765) org.apache.spark.mllib.linalg.VectorUDT cannot be cast to org.apache.spark.sql.types.StructType

2016-10-03 Thread Alexander Shorin (JIRA)
Alexander Shorin created SPARK-17765:


 Summary: org.apache.spark.mllib.linalg.VectorUDT cannot be cast to 
org.apache.spark.sql.types.StructType
 Key: SPARK-17765
 URL: https://issues.apache.org/jira/browse/SPARK-17765
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark, SQL
Affects Versions: 1.6.1
Reporter: Alexander Shorin


The issue in subject happens on attempt to transform DataFrame in Parquet 
format into ORC while DF contains SparseVector data. DF with DenseVector 
transforms fine.

In 
[sources|https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L192]
 it looks like that there shouldn't be any serialization issues, but they 
happens.

{code}
In[4] pqtdf = hqlctx.read.parquet(pqt_feature)

In[5] pqtdf.take(1)
Out[5]: [Row(foo=u'abc, bar=SparseVector(100, {74: 1.0}))]

In[6]: pqtdf.write.format('orc').save('/tmp/orc')
---
Py4JJavaError Traceback (most recent call last)
 in ()
> pqtdf.write.format('orc').save('/tmp/orc')

/usr/local/share/spark/python/pyspark/sql/readwriter.pyc in save(self, path, 
format, mode, partitionBy, **options)
395 self._jwrite.save()
396 else:
--> 397 self._jwrite.save(path)
398 
399 @since(1.4)

/usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, 
*args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814 
815 for temp_arg in temp_args:

/usr/local/share/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in 
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling o62.save.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(Ca

[jira] [Resolved] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17512.
---
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/15137

> Specifying remote files for Python based Spark jobs in Yarn cluster mode not 
> working
> 
>
> Key: SPARK-17512
> URL: https://issues.apache.org/jira/browse/SPARK-17512
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit, YARN
>Affects Versions: 2.0.0
>Reporter: Udit Mehrotra
>Assignee: Saisai Shao
> Fix For: 2.0.1, 2.1.0
>
>
> When I run a python application, and specify a remote path for the extra 
> files to be included in the PYTHON_PATH using the '--py-files' or 
> 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the 
> following error:
> Exception in thread "main" java.lang.IllegalArgumentException: Launching 
> Python applications through spark-submit is currently only supported for 
> local files: s3:///app.py
> at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
> at 
> org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> Here are sample commands which would throw this error in Spark 2.0 
> (sparkApp.py requires app.py):
> spark-submit --deploy-mode cluster --py-files s3:///app.py 
> s3:///sparkApp.py (works fine in 1.6)
> spark-submit --deploy-mode cluster --conf 
> spark.submit.pyFiles=s3:///app.py s3:///sparkApp1.py (not working in 
> 1.6)
> This would work fine if app.py is downloaded locally and specified.
> This was working correctly using ‘—py-files’ option in earlier version of 
> Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But 
> now, it does not work through either of the ways.
> The following diff shows the comment which states that it should work with 
> ‘non-local’ paths for the YARN cluster mode, and we are specifically doing 
> separate validation to fail if YARN client mode is used with remote paths:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309
> And then this code gets triggered at the end of each run, irrespective of 
> whether we are using Client or Cluster mode, and internally validates that 
> the paths should be non-local:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634
> This above validation was not getting triggered in earlier version of Spark 
> using ‘—py-files’ option because we were not storing the arguments passed to 
> ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, 
> the following code was newly added in 2.0 which now stores it and hence this 
> validation gets triggered even if we specify files through ‘—py-files’ option:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545
> Also, we changed the logic in YARN client, to read values directly from 
> ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):
> https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543
> So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the 
> validation gets triggered in both cases irrespective of whether we use Client 
> or Cluster mode with YARN.



--
This message was sent by At

[jira] [Created] (SPARK-17766) Write ahead corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)
Nadav Samet created SPARK-17766:
---

 Summary: Write ahead corruption on a toy project
 Key: SPARK-17766
 URL: https://issues.apache.org/jira/browse/SPARK-17766
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Nadav Samet


Write ahead log seems to get corrupted when the application is stopped abruptly 
(Ctrl-C, or kill). Then, the application refuses to run due to this exception:

{code}
2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
13994
...skipping...
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Code:
{code}
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._

object ProtoDemo {
  def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
/*
conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", "true")
conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
"true")
*/

val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", )
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
  (values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
  }

  // Print the first ten elements of each RDD generated in this DStream to the 
console
runningCounts.print()
ssc
  }

  def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => 
createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
  }
}
{code}

Steps to reproduce:
1. I put the code in a repository: git clone 
https://github.com/thesamet/spark-issue
2. in one terminal: {{ while true; do nc -l localhost ; done}}
3. In another terminal "sbt run".
4. Type a few lines in the netcat terminal.
5. Kill the streaming project (Ctrl-C), 
6. Go back to step 2 until you see the exception above.

I tried the above with local filesystem and also with S3, and getting the same 
result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Alex Liu (JIRA)
Alex Liu created SPARK-17767:


 Summary: Spark SQL ExternalCatalog API custom implementation 
support
 Key: SPARK-17767
 URL: https://issues.apache.org/jira/browse/SPARK-17767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Alex Liu


There is no way/easy way to configure Spark to use customized ExternalCatalog.  
Internal source code is hardcoded to use either hive or in-memory metastore. 
Spark SQL thriftserver is hardcoded to use HiveExternalCatalog. We should be 
able to create a custom external catalog and thriftserver should be able to use 
it. Potentially Spark SQL thriftserver shouldn't depend on Hive thriftserer.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Samet updated SPARK-17766:

Summary: Write ahead log corruption on a toy project  (was: Write ahead 
corruption on a toy project)

> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. In another terminal "sbt run".
> 4. Type a few lines in the netcat terminal.
> 5. Kill the streaming project (Ctrl-C), 
> 6. Go back to step 2 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Samet updated SPARK-17766:

Description: 
Write ahead log seems to get corrupted when the application is stopped abruptly 
(Ctrl-C, or kill). Then, the application refuses to run due to this exception:

{code}
2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
13994
...skipping...
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Code:
{code}
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._

object ProtoDemo {
  def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
/*
conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", "true")
conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
"true")
*/

val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", )
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
  (values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
  }

  // Print the first ten elements of each RDD generated in this DStream to the 
console
runningCounts.print()
ssc
  }

  def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => 
createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
  }
}
{code}

Steps to reproduce:
1. I put the code in a repository: git clone 
https://github.com/thesamet/spark-issue
2. in one terminal: {{ while true; do nc -l localhost ; done}}
3. Start a new terminal
4. Run "sbt run".
5. Type a few lines in the netcat terminal.
6. Kill the streaming project (Ctrl-C), 
7. Go back to step 4 until you see the exception above.

I tried the above with local filesystem and also with S3, and getting the same 
result.

  was:
Write ahead log seems to get corrupted when the application is stopped abruptly 
(Ctrl-C, or kill). Then, the application refuses to run due to this exception:

{code}
2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
13994
...skipping...
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542689#comment-15542689
 ] 

Jacob Eisinger commented on SPARK-17728:


Thanks for the explanation and the tricky code snippet!  I kind of figured it 
was optimizing incorrectly / over optimizing.  It sounds like this is not a 
defect because normally this optimization of collapsing projects is the desired 
route.  Correct?

Do you think it is worth filing a feature request to allow working with costly 
UDFs?  Possibly:
 * Memoize UDFs / other transforms on a per row basis.
 * Manually override costs for UDFs.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542689#comment-15542689
 ] 

Jacob Eisinger edited comment on SPARK-17728 at 10/3/16 3:37 PM:
-

Thanks for the explanation and the tricky code snippet!  I kind of figured it 
was optimizing incorrectly / over optimizing.  It sounds like this is not a 
defect because normally this optimization of collapsing projects is the desired 
route.  Correct?

Do you think it is worth filing a feature request to allow working with costly 
UDFs?  Possibly:
* Memoize UDFs / other transforms on a per row basis.
* Manually override costs for UDFs.


was (Author: jeisinge):
Thanks for the explanation and the tricky code snippet!  I kind of figured it 
was optimizing incorrectly / over optimizing.  It sounds like this is not a 
defect because normally this optimization of collapsing projects is the desired 
route.  Correct?

Do you think it is worth filing a feature request to allow working with costly 
UDFs?  Possibly:
 * Memoize UDFs / other transforms on a per row basis.
 * Manually override costs for UDFs.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17768) Small {Sum,Count,Mean}Evaluator problems and suboptimalities

2016-10-03 Thread Sean Owen (JIRA)
Sean Owen created SPARK-17768:
-

 Summary: Small {Sum,Count,Mean}Evaluator problems and 
suboptimalities
 Key: SPARK-17768
 URL: https://issues.apache.org/jira/browse/SPARK-17768
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Sean Owen
Assignee: Sean Owen


This tracks a few related issues with 
org.apache.spark.partial.(Count,Mean,Sum)Evaluator and their "Grouped" 
counterparts:

- GroupedMeanEvaluator and GroupedSumEvaluator are unused, as is the 
StudentTCacher support class
- CountEvaluator can return a lower bound < 0, when counts can't be negative
- MeanEvaluator will actually fail on exactly 1 datum (yields t-test with 0 DOF)
- CountEvaluator uses a normal distribution, which may be an inappropriate 
approximation (leading to above)
- CountEvaluator, MeanEvaluator have no unit tests to catch these
- Duplication across CountEvaluator, GroupedCountEvaluator
- SumEvaluator might have an issue related to CountEvaluator (or could delegate 
to compute CountEvaluator times MeanEvaluator?)
- The stats in each could use a bit of documentation as I had to guess at them
- (Code could use a few cleanups and optimizations too)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542709#comment-15542709
 ] 

Sean Owen commented on SPARK-17766:
---

Where is the corruption here though.. this is just an exception you get while 
killing the stream. The bad thing is that the exception doesn't get actually 
deserialized by Kryo because it's not registered. You could temporarily turn 
off kryo registration to see what's really going on.

> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542790#comment-15542790
 ] 

Nadav Samet commented on SPARK-17766:
-

Can you elaborate on how to turn off kryo registration?

I've been trying this but got the same result:

https://github.com/thesamet/spark-issue/commit/022efe8073b624ad2598eaf541e764c6f414ef84

The docs indicate that the default serializer is JavaSerializer - I'm not sure 
why Kryo is being used.


> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542799#comment-15542799
 ] 

Sean Owen commented on SPARK-17766:
---

Yes, you would need to turn it off. That may let the real exception show. I 
believe kryo is always used in some contexts, like serializing primitives. You 
may also have some other environmental config turning it on? it's clearly using 
kryo.

> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-10-03 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542830#comment-15542830
 ] 

Srinath commented on SPARK-17074:
-

IMO if you can get reasonable error bounds (as Tim points out) the method with 
lower overhead is preferable. In general you can't rely on exact statistics 
during optimization anyway since new data may have arrived since the last stats 
collection

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17749) Unresolved columns when nesting SQL join clauses

2016-10-03 Thread Andreas Damm (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542850#comment-15542850
 ] 

Andreas Damm commented on SPARK-17749:
--

Combining the two ON clauses into one by anding the conditions does indeed 
solve the problem.

> Unresolved columns when nesting SQL join clauses
> 
>
> Key: SPARK-17749
> URL: https://issues.apache.org/jira/browse/SPARK-17749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>
> Given tables
> CREATE TABLE `sf_datedconversionrate2`(`isocode` string)
> CREATE TABLE `sf_opportunity2`(`currencyisocode` string, `accountid` string)
> CREATE TABLE `sf_account2`(`id` string)
> the following SQL will cause an analysis exception (cannot resolve 
> '`sf_opportunity.currencyisocode`' given input columns: [isocode, id])
> SELECT0 
> FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
> LEFT JOIN `sf_account2` AS `sf_account` 
> LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
> ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
> ON`sf_datedconversionrate`.`isocode` = 
> `sf_opportunity`.`currencyisocode` 
> even though all columns referred to in the conditions should be in scope.
> Re-ordering the join and on clauses will make it work
> SELECT0 
> FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
> LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
> LEFT JOIN `sf_account2` AS `sf_account` 
> ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
> ON`sf_datedconversionrate`.`isocode` = 
> `sf_opportunity`.`currencyisocode` 
> but the original should work also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542862#comment-15542862
 ] 

Dongjoon Hyun commented on SPARK-17767:
---

+1

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542877#comment-15542877
 ] 

Dongjoon Hyun edited comment on SPARK-12985 at 10/3/16 5:05 PM:


Hi, [~alexliu68] and [~adrian-wang].
I reached here while I'm reviewing some issues about 
`SparkExecuteStatementOperator.scala`.
If there is no problem with other drivers, it seems that we can close this 
issue as 'NOT A PROBLEM'.
How do you think about that?


was (Author: dongjoon):
Hi, [~alexliu68] and [~adrian-wang].
I reached here while I'm reviewing some issues about 
`SparkExecuteStatementOperator.scala`.
If there is no problem with other drivers, it seems that we can close this 
issue as 'NOT A PROBLE'.
How do you think about that?

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542877#comment-15542877
 ] 

Dongjoon Hyun commented on SPARK-12985:
---

Hi, [~alexliu68] and [~adrian-wang].
I reached here while I'm reviewing some issues about 
`SparkExecuteStatementOperator.scala`.
If there is no problem with other drivers, it seems that we can close this 
issue as 'NOT A PROBLE'.
How do you think about that?

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17073) generate basic stats for column

2016-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17073.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.1.0

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.1.0
>
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-10-03 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542909#comment-15542909
 ] 

Alex Liu commented on SPARK-12985:
--

It's ok to close it, It could be fixed from Simba side.

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-10-03 Thread Shreya Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542942#comment-15542942
 ] 

Shreya Agarwal commented on SPARK-5484:
---

Hi,

I am running a Spark 2.0 cluster and want to check if there is a way I can 
deploy this fix onto that. Also, it is kind of urgent :)

Regards,
Shreya

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542956#comment-15542956
 ] 

Nadav Samet commented on SPARK-17766:
-

I don't believe there's anything environmental going on. In this standalone 
setup, I just run "sbt run" to have a local in-process standalone cluster.  
It's just default vanilla spark (but it fails in the same way when I use 
spark-submit to a local standalone cluster)

> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17766) Write ahead log corruption on a toy project

2016-10-03 Thread Nadav Samet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542956#comment-15542956
 ] 

Nadav Samet edited comment on SPARK-17766 at 10/3/16 5:43 PM:
--

I don't believe there's anything environmental going on. In this standalone 
setup, I just run "sbt run" to have a local in-process standalone cluster.  
It's just default vanilla spark (but it fails in the same way when I use 
spark-submit to a local standalone cluster).

This problem does not happen when I disable the write ahead logs. Perhaps they 
are using Kryo, but I am not aware of any knob to disable that.



was (Author: thesamet):
I don't believe there's anything environmental going on. In this standalone 
setup, I just run "sbt run" to have a local in-process standalone cluster.  
It's just default vanilla spark (but it fails in the same way when I use 
spark-submit to a local standalone cluster)

> Write ahead log corruption on a toy project
> ---
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message w

[jira] [Commented] (SPARK-17749) Unresolved columns when nesting SQL join clauses

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542964#comment-15542964
 ] 

Dongjoon Hyun commented on SPARK-17749:
---

Good for you!

> Unresolved columns when nesting SQL join clauses
> 
>
> Key: SPARK-17749
> URL: https://issues.apache.org/jira/browse/SPARK-17749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>
> Given tables
> CREATE TABLE `sf_datedconversionrate2`(`isocode` string)
> CREATE TABLE `sf_opportunity2`(`currencyisocode` string, `accountid` string)
> CREATE TABLE `sf_account2`(`id` string)
> the following SQL will cause an analysis exception (cannot resolve 
> '`sf_opportunity.currencyisocode`' given input columns: [isocode, id])
> SELECT0 
> FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
> LEFT JOIN `sf_account2` AS `sf_account` 
> LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
> ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
> ON`sf_datedconversionrate`.`isocode` = 
> `sf_opportunity`.`currencyisocode` 
> even though all columns referred to in the conditions should be in scope.
> Re-ordering the join and on clauses will make it work
> SELECT0 
> FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
> LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
> LEFT JOIN `sf_account2` AS `sf_account` 
> ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
> ON`sf_datedconversionrate`.`isocode` = 
> `sf_opportunity`.`currencyisocode` 
> but the original should work also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-03 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542995#comment-15542995
 ] 

Frederick Reiss commented on SPARK-15777:
-

Thanks for posting this design doc. Some comments from a quick read-through:
* I'm having some trouble following the description of optimizer rule 
pluggability in the current design doc. If the user is connected to multiple 
external data sources, will the optimizer rules for all those data sources be 
active all the time, even when running queries that only touch a single data 
source? If there are rules active from more than one external data source, in 
what order will those rules be evaluated? Will the order of evaluation (and 
hence the query plan) change if the user connects to the data sources in a 
different order?  Note that if you could post a link to the WIP implementation 
it would probably clear up my questions here quickly.
* Limitation number (2) seems pretty significant to me.  In a site uses 
multiple external data sources, users will need to include boilerplate at the 
beginning of every application to connect to all of the data sources' catalogs 
explicitly. That situation would be similar to the problem that motivated this 
JIRA in the first place. I think it would be good to include a mechanism for 
storing a persistent "meta-catalog" of external catalogs.
* I don't see a mention of the namespace handling issues that the description 
for this JIRA mentions. With the current design, it looks like two external 
catalogs could define the same table name. Did I miss something, or is the plan 
to push out resolving namespace conflicts to a follow-on JIRA?

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-10-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543008#comment-15543008
 ] 

Shivaram Venkataraman commented on SPARK-17752:
---

Thanks for the bug report - I can't seem to reproduce this on a build from 
master branch or in the 2.0.1 RC4 that just passed the vote, but I am not sure 
what change actually fixed this. 

It'll be great if you could also verify whether 2.0.1 fixes your problem and if 
so we can mark this issue as resolved.

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543011#comment-15543011
 ] 

Dongjoon Hyun commented on SPARK-12985:
---

Thank you, [~alexliu68]

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-10-03 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-12985.
-
Resolution: Not A Problem

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17718) Make loss function formulation label note clearer in MLlib docs

2016-10-03 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-17718.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Make loss function formulation label note clearer in MLlib docs
> ---
>
> Key: SPARK-17718
> URL: https://issues.apache.org/jira/browse/SPARK-17718
> Project: Spark
>  Issue Type: Documentation
>Reporter: Tobi Bosede
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
>
> https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal
> The loss function here for logistic regression is confusing. It seems to 
> imply that spark uses only -1 and 1 class labels. However it uses 0,1.  Note 
> below needs to make this point more visible to avoid confusion.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."
> Better yet, the loss function should be replaced with that for 0, 1 despite 
> mathematical inconvenience, since that is what is actually implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543046#comment-15543046
 ] 

Dilip Biswal commented on SPARK-17709:
--

Hi Ashish, Thanks a lot.. will try and get back.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-10-03 Thread Kevin Ushey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543045#comment-15543045
 ] 

Kevin Ushey commented on SPARK-17752:
-

I can confirm that everything is okay with Spark 2.0.2-SNAPSHOT (as retrieved 
from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/).

I also reproduced this issue with Spark 1.6.2, so it might be worth 
investigating what fixed this issue, and if that fix should be backported to 
e.g. Spark 1.6.3. (not sure if long-term support of Spark 1.6 is intended)

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17742) Spark Launcher does not get failed state in Listener

2016-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543048#comment-15543048
 ] 

Marcelo Vanzin commented on SPARK-17742:


This code in "LocalSchedulerBackend" causes the issue you're seeing:

{code}
  override def stop() {
stop(SparkAppHandle.State.FINISHED)
  }
{code}

That code is run both when you explicitly stop a SparkContext, or when the VM 
shuts down, via a shutdown hook. You could remove that line (and similar lines 
in other backends), and change so that if the child process exist with a 0 exit 
code, then the app is "successful". But need to make sure it still works with 
YARN (both client and cluster mode), since the logic to report app state is 
different there.

> Spark Launcher does not get failed state in Listener 
> -
>
> Key: SPARK-17742
> URL: https://issues.apache.org/jira/browse/SPARK-17742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I tried to launch an application using the below code. This is dummy code to 
> reproduce the problem. I tried exiting spark with status -1, throwing an 
> exception etc. but in no case did the listener give me failed status. But if 
> a spark job returns -1 or throws an exception from the main method it should 
> be considered as a failure. 
> {code}
> package com.example;
> import org.apache.spark.launcher.SparkAppHandle;
> import org.apache.spark.launcher.SparkLauncher;
> import java.io.IOException;
> public class Main2 {
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("/opt/spark2")
> 
> .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
> .setMainClass("com.example.Main")
> .setMaster("local[2]");
> launcher.startApplication(new MyListener());
> Thread.sleep(1000 * 60);
> }
> }
> class MyListener implements SparkAppHandle.Listener {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> System.out.println("state changed " + handle.getState());
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> System.out.println("info changed " + handle.getState());
> }
> }
> {code}
> The spark job is 
> {code}
> package com.example;
> import org.apache.spark.sql.SparkSession;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) throws IOException {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("" + System.currentTimeMillis())
> .getOrCreate();
> try {
> for (int i = 0; i < 15; i++) {
> Thread.sleep(1000);
> System.out.println("sleeping 1");
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> //sparkSession.stop();
> System.exit(-1);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17720) Static configurations in SQL

2016-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17720:

Summary: Static configurations in SQL  (was: introduce global SQL conf)

> Static configurations in SQL
> 
>
> Key: SPARK-17720
> URL: https://issues.apache.org/jira/browse/SPARK-17720
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543059#comment-15543059
 ] 

Apache Spark commented on SPARK-10634:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15332

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17720) Static configurations in SQL

2016-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17720:

Description: 
Spark SQL has two kinds of configuration parameters: dynamic configs and static 
configs. Dynamic configs can be modified after Spark SQL is launched (after 
SparkSession is setup), whereas static configs are immutable once the service 
starts.

It would be useful to have this separation and tell users if the user tries to 
set a static config after the service starts.

> Static configurations in SQL
> 
>
> Key: SPARK-17720
> URL: https://issues.apache.org/jira/browse/SPARK-17720
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> Spark SQL has two kinds of configuration parameters: dynamic configs and 
> static configs. Dynamic configs can be modified after Spark SQL is launched 
> (after SparkSession is setup), whereas static configs are immutable once the 
> service starts.
> It would be useful to have this separation and tell users if the user tries 
> to set a static config after the service starts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17761) Simplify InternalRow hierarchy

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17761:


Assignee: Herman van Hovell  (was: Apache Spark)

> Simplify InternalRow hierarchy
> --
>
> Key: SPARK-17761
> URL: https://issues.apache.org/jira/browse/SPARK-17761
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The current InternalRow hierarchy makes a difference between immutable and 
> mutable rows. In practice we cannot guarantee that an immutable internal row 
> is immutable (you can always pass a mutable object as an one of its 
> elements). Lets make all internal rows mutable (and reduce the complexity).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17761) Simplify InternalRow hierarchy

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17761:


Assignee: Apache Spark  (was: Herman van Hovell)

> Simplify InternalRow hierarchy
> --
>
> Key: SPARK-17761
> URL: https://issues.apache.org/jira/browse/SPARK-17761
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The current InternalRow hierarchy makes a difference between immutable and 
> mutable rows. In practice we cannot guarantee that an immutable internal row 
> is immutable (you can always pass a mutable object as an one of its 
> elements). Lets make all internal rows mutable (and reduce the complexity).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17761) Simplify InternalRow hierarchy

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543079#comment-15543079
 ] 

Apache Spark commented on SPARK-17761:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15333

> Simplify InternalRow hierarchy
> --
>
> Key: SPARK-17761
> URL: https://issues.apache.org/jira/browse/SPARK-17761
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The current InternalRow hierarchy makes a difference between immutable and 
> mutable rows. In practice we cannot guarantee that an immutable internal row 
> is immutable (you can always pass a mutable object as an one of its 
> elements). Lets make all internal rows mutable (and reduce the complexity).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543097#comment-15543097
 ] 

Franck Tago commented on SPARK-17758:
-

I tested the behavior of the min and max function with sort aggregate and an 
empty partition and the results were correct. 

I would also note that this issue is not reproducible in spark version 1.6

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10634) The spark sql fails if the where clause contains a string with " in it.

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543104#comment-15543104
 ] 

Apache Spark commented on SPARK-10634:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15334

> The spark sql fails if the where clause contains a string with " in it.
> ---
>
> Key: SPARK-10634
> URL: https://issues.apache.org/jira/browse/SPARK-10634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Prachi Burathoki
>
> When running a sql query in which the where clause contains a string with " 
> in it, the sql parser throws an error.
> Caused by: java.lang.RuntimeException: [1.127] failure: ``)'' expected but 
> identifier test found
> SELECT clistc215647292, corc1749453704, candc1501025950, SYSIBM_ROW_NUMBER 
> FROM TABLE_1 WHERE ((clistc215647292 = "this is a "test""))
>   
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
>   at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
>   at 
> com.ibm.is.drs.engine.spark.sql.task.SQLQueryTask.createTargetRDD(SQLQueryTask.java:106)
>   at 
> com.ibm.is.drs.engine.spark.sql.SQLQueryNode.createTargetRDD(SQLQueryNode.java:93)
>   at com.ibm.is.drs.engine.spark.sql.SQLNode.doExecute(SQLNode.java:153)
>   at com.ibm.is.drs.engine.spark.api.BaseNode.execute(BaseNode.java:291)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:840)
>   at 
> com.ibm.is.drs.engine.spark.api.SessionContext.applyDataShaping(SessionContext.java:752)
>   at 
> com.ibm.is.drs.engine.spark.api.SparkRefineEngine.applyDataShaping(SparkRefineEngine.java:1011)
>   ... 31 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17766) Write ahead log exception on a toy project

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17766:
--
Priority: Minor  (was: Major)
 Summary: Write ahead log exception on a toy project  (was: Write ahead log 
corruption on a toy project)

As far as I can see it is certainly something going wrong when you enable the 
WAL, but not data corruption. You do have Kryo enabled one way or the other 
here and should try disabling registration.

> Write ahead log exception on a toy project
> --
>
> Key: SPARK-17766
> URL: https://issues.apache.org/jira/browse/SPARK-17766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Nadav Samet
>Priority: Minor
>
> Write ahead log seems to get corrupted when the application is stopped 
> abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this 
> exception:
> {code}
> 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] 
> executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
> ...skipping...
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Code:
> {code}
> import org.apache.hadoop.conf.Configuration
> import org.apache.spark._
> import org.apache.spark.streaming._
> object ProtoDemo {
>   def createContext(dirName: String) = {
> val conf = new SparkConf().setAppName("mything").setMaster("local[4]")
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
> /*
> conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", 
> "true")
> conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", 
> "true")
> */
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint(dirName)
> val lines = ssc.socketTextStream("127.0.0.1", )
> val words = lines.flatMap(_.split(" "))
> val pairs = words.map(word => (word, 1))
> val wordCounts = pairs.reduceByKey(_ + _)
> val runningCounts = wordCounts.updateStateByKey[Int] {
>   (values: Seq[Int], oldValue: Option[Int]) =>
> val s = values.sum
> Some(oldValue.fold(s)(_ + s))
>   }
>   // Print the first ten elements of each RDD generated in this DStream to 
> the console
> runningCounts.print()
> ssc
>   }
>   def main(args: Array[String]) = {
> val hadoopConf = new Configuration()
> val dirName = "/tmp/chkp"
> val ssc = StreamingContext.getOrCreate(dirName, () => 
> createContext(dirName), hadoopConf)
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> Steps to reproduce:
> 1. I put the code in a repository: git clone 
> https://github.com/thesamet/spark-issue
> 2. in one terminal: {{ while true; do nc -l localhost ; done}}
> 3. Start a new terminal
> 4. Run "sbt run".
> 5. Type a few lines in the netcat terminal.
> 6. Kill the streaming project (Ctrl-C), 
> 7. Go back to step 4 until you see the exception above.
> I tried the above with local filesystem and also with S3, and getting the 
> same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-10-03 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-17769:


 Summary: Some FetchFailure refactoring in the DAGScheduler
 Key: SPARK-17769
 URL: https://issues.apache.org/jira/browse/SPARK-17769
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Mark Hamstra
Assignee: Mark Hamstra
Priority: Minor


SPARK-17644 opened up a discussion about further refactoring of the 
DAGScheduler's handling of FetchFailure events.  These include:
* rewriting code and comments to improve readability
* doing fetchFailedAttemptIds.add(stageAttemptId) even when 
disallowStageRetryForTest is true
* issuing a ResubmitFailedStages event based on whether one is already enqueued 
for the current failed stage, not any prior failed stage
* logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543138#comment-15543138
 ] 

Apache Spark commented on SPARK-17769:
--

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/15335

> Some FetchFailure refactoring in the DAGScheduler
> -
>
> Key: SPARK-17769
> URL: https://issues.apache.org/jira/browse/SPARK-17769
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> SPARK-17644 opened up a discussion about further refactoring of the 
> DAGScheduler's handling of FetchFailure events.  These include:
> * rewriting code and comments to improve readability
> * doing fetchFailedAttemptIds.add(stageAttemptId) even when 
> disallowStageRetryForTest is true
> * issuing a ResubmitFailedStages event based on whether one is already 
> enqueued for the current failed stage, not any prior failed stage
> * logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17769:


Assignee: Mark Hamstra  (was: Apache Spark)

> Some FetchFailure refactoring in the DAGScheduler
> -
>
> Key: SPARK-17769
> URL: https://issues.apache.org/jira/browse/SPARK-17769
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Minor
>
> SPARK-17644 opened up a discussion about further refactoring of the 
> DAGScheduler's handling of FetchFailure events.  These include:
> * rewriting code and comments to improve readability
> * doing fetchFailedAttemptIds.add(stageAttemptId) even when 
> disallowStageRetryForTest is true
> * issuing a ResubmitFailedStages event based on whether one is already 
> enqueued for the current failed stage, not any prior failed stage
> * logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17769) Some FetchFailure refactoring in the DAGScheduler

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17769:


Assignee: Apache Spark  (was: Mark Hamstra)

> Some FetchFailure refactoring in the DAGScheduler
> -
>
> Key: SPARK-17769
> URL: https://issues.apache.org/jira/browse/SPARK-17769
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Mark Hamstra
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-17644 opened up a discussion about further refactoring of the 
> DAGScheduler's handling of FetchFailure events.  These include:
> * rewriting code and comments to improve readability
> * doing fetchFailedAttemptIds.add(stageAttemptId) even when 
> disallowStageRetryForTest is true
> * issuing a ResubmitFailedStages event based on whether one is already 
> enqueued for the current failed stage, not any prior failed stage
> * logging the resubmission of all failed stages 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141
 ] 

Dilip Biswal commented on SPARK-17709:
--

@ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in 
the outputset and that is why we see the exception. I tried using productid 
instead and got
the correct result.

{code}
scala> df1.join(df2, Seq("companyid","loyaltycardnumber"));
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[productid, companyid, avgprice, avgitemcount, companyid, productid] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:679)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:652)
  ... 48 elided

scala> df1.join(df2, Seq("companyid","productid"));
res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2 
more fields]

scala> df1.join(df2, Seq("companyid","productid")).show
+-+-+++ 
|companyid|productid|avgprice|avgitemcount|
+-+-+++
|  101|3|13.0|12.0|
|  100|1|10.0|10.0|
+-+-+++
{code}

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17758:
--
Description: 
My Environment 
Spark 2.0.0  

I have included the physical plan of my application below.
Issue description
The result from  a query that uses the LAST function are incorrect. 

The output obtained for the column that corresponds to the last function is 
null .  

My input data contain 3 rows . 

The application resulted in  2 stages 

The first stage consisted of 3 tasks . 

The first task/partition contains 2 rows
The second task/partition contains 1 row
The last task/partition contain  0 rows

The result from the query executed for the LAST column call is NULL which I 
believe is due to the  PARTIAL_LAST on the last partition . 

I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
partition should not return null .

{noformat}
== Physical Plan ==
InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
+- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS 
field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
   +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
output=[max(C3_0)#50,last(C3_1)#51])
  +- SortAggregate(key=[], 
functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
output=[max#91,last#92])
 +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS 
C3_1#41]
+- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
DOUBLE)#27,last(C1_1)#28])
   +- Exchange SinglePartition
  +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 
as bigint)),partial_last(C1_1#18, false)], output=[sum#95L,last#96])
 +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
+- HiveTableScan [field1#7, field#6], MetastoreRelation 
default, bdm_3449_src, alias
{noformat}

  was:
My Environment 
Spark 2.0.0  

I have included the physical plan of my application below.
Issue description
The result from  a query that uses the LAST function are incorrect. 

The output obtained for the column that corresponds to the last function is 
null .  

My input data contain 3 rows . 

The application resulted in  2 stages 

The first stage consisted of 3 tasks . 

The first task/partition contains 2 rows
The second task/partition contains 1 row
The last task/partition contain  0 rows

The result from the query executed for the LAST column call is NULL which I 
believe is due to the  PARTIAL_LAST on the last partition . 

I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
partition should not return null .


== Physical Plan ==
InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
+- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS 
field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
   +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
output=[max(C3_0)#50,last(C3_1)#51])
  +- SortAggregate(key=[], 
functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
output=[max#91,last#92])
 +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS 
C3_1#41]
+- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
DOUBLE)#27,last(C1_1)#28])
   +- Exchange SinglePartition
  +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 
as bigint)),partial_last(C1_1#18, false)], output=[sum#95L,last#96])
 +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
+- HiveTableScan [field1#7, field#6], MetastoreRelation 
default, bdm_3449_src, alias



> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-10-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17752:
--
Fix Version/s: 2.1.0
   2.0.1

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543175#comment-15543175
 ] 

Herman van Hovell commented on SPARK-17758:
---

Why should last return a non null result? It returns the last observation it 
encounters (which is pretty random in a clustered environment)? Last is only 
deterministic when you have sorted on the column you are calling last on.

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-10-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543179#comment-15543179
 ] 

Shivaram Venkataraman commented on SPARK-17752:
---

I don't know what the plans are for 1.6.3 -- I'm marking this issue as resolved 
with fix versions as 2.0.1 and 2.1.0. If we figure out what to backport, we can 
update the JIRA after merging to branch-1.6 etc.

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-10-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17752.
---
Resolution: Fixed

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
> Fix For: 2.1.0, 2.0.1
>
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17764:
-
Target Version/s: 2.1.0

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-10-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543229#comment-15543229
 ] 

Michael Armbrust commented on SPARK-17764:
--

I would start by looking at the implementation of {{DataFrame.toJSON()}}, and 
my PR that added {{from_json}}.

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10637) DataFrames: saving with nested User Data Types

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543246#comment-15543246
 ] 

Apache Spark commented on SPARK-10637:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15334

> DataFrames: saving with nested User Data Types
> --
>
> Key: SPARK-10637
> URL: https://issues.apache.org/jira/browse/SPARK-10637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Joao Duarte
>
> Cannot save data frames using nested UserDefinedType
> I wrote a simple example to show the error.
> It causes the following error java.lang.IllegalArgumentException: Nested type 
> should be repeated: required group array {
>   required int32 num;
> }
> {code}
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> import org.apache.spark.sql.types._
> @SQLUserDefinedType(udt = classOf[AUDT])
> case class A(list:Seq[B])
> class AUDT extends UserDefinedType[A] {
>   override def sqlType: DataType = StructType(Seq(StructField("list", 
> ArrayType(BUDT, containsNull = false), nullable = true)))
>   override def userClass: Class[A] = classOf[A]
>   override def serialize(obj: Any): Any = obj match {
> case A(list) =>
>   val row = new GenericMutableRow(1)
>   row.update(0, new 
> GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
>   row
>   }
>   override def deserialize(datum: Any): A = {
> datum match {
>   case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
> }
>   }
> }
> object AUDT extends AUDT
> @SQLUserDefinedType(udt = classOf[BUDT])
> case class B(num:Int)
> class BUDT extends UserDefinedType[B] {
>   override def sqlType: DataType = StructType(Seq(StructField("num", 
> IntegerType, nullable = false)))
>   override def userClass: Class[B] = classOf[B]
>   override def serialize(obj: Any): Any = obj match {
> case B(num) =>
>   val row = new GenericMutableRow(1)
>   row.setInt(0, num)
>   row
>   }
>   override def deserialize(datum: Any): B = {
> datum match {
>   case row: InternalRow => new B(row.getInt(0))
> }
>   }
> }
> object BUDT extends BUDT
> object TestNested {
>   def main(args:Array[String]) = {
> val col = Seq(new A(Seq(new B(1), new B(2))),
>   new A(Seq(new B(3), new B(4
> val sc = new SparkContext(new 
> SparkConf().setMaster("local[1]").setAppName("TestSpark"))
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(1 to 2 zip col).toDF()
> df.show()
> df.write.mode(SaveMode.Overwrite).save(...)
>   }
> }
> {code}
> The error log is shown below:
> {code}
> 15/09/16 16:44:36 WARN : Your hostname, X resolves to a 
> loopback/non-reachable address: fe80:0:0:0:c4c7:8c4b:4a24:f8a1%14, but we 
> couldn't find any external IP address!
> 15/09/16 16:44:38 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 15/09/16 16:44:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 15/09/16 16:44:38 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 15/09/16 16:44:38 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 15/09/16 16:44:38 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 15/09/16 16:44:38 INFO ParquetRelation: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 15/09/16 16:44:38 INFO DefaultWriterContainer: Using user defined output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 15/09/16 16:44:38 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
> localhost:50986 in memory (size: 1402.0 B, free: 973.6 MB)
> 15/09/16 16:44:38 INFO ContextCleaner: Cleaned accumulator 1
> 15/09/16 16:44:39 INFO SparkContext: Starting job: save at TestNested.scala:73
> 15/09/16 16:44:39 INFO DAGScheduler: Got job 1 (save at TestNested.scala:73) 
> with 1 output partitions
> 15/09/16 16:44:39 INFO DAGScheduler: Final stage: ResultStage 1(save at 
> TestNested.scala:73)
> 15/09/16 16:44:39 INFO DAGScheduler: Parents of final stage: List()
> 15/09/16 16:44:39 INFO DAGScheduler: Missing parents: List()
> 15/09/16 16:44:39 INFO DAGScheduler: Submitting ResultStage 1 
> (MapPartitionsRDD[1] at rddToDataFrameHolder at TestNested.scala:69), which 
> has no missing parents
> 15/09/16 16:44:39 INFO MemoryStore: ensureFreeSpace(59832) called with 
> curMem=0, maxMem=1020914565
> 15/09/16 16:44:39 INFO MemoryStore:

[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543319#comment-15543319
 ] 

Dongjoon Hyun commented on SPARK-17767:
---

It seems that custom external catalog supporting becomes simpler in Spark 2.1.0 
after SPARK-17190.
For the external catalog based on Hive classes, we can override easily by 
parameterizing some private string variable into SQLConf. I'll make a PR for a 
first attempt.

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-03 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543329#comment-15543329
 ] 

Yan commented on SPARK-15777:
-

1) Currently the rules are applied on a per-session basis. Right, ideally they 
should be applied on a per-query basis. We can modify the design/implementation 
in that direction. Regarding evaluation ordering, item 5) of the "Scopes, 
Limitations and Open Questions" is on this topic. In short, there is an 
ordering between the built-in rules and custom rules, but not among the custom 
rules. The plugin mechanism is for cooperative behavior so the plugged rules 
are expected to be applied against their specific data sources of the plans 
only, probably after some plan rewriting. Once the overall ideas are accepted 
by the community, we will flesh out the design doc and post the implementation 
in a WIP fashion.
2) As mentioned in the doc, this is a not complete design. Hopefully it can lay 
down some basic concepts and principles so future work can be built on top of 
it. For instance, persistent catalog itself could be another major feature but 
it is left out of the scope of this design for now without affecting the 
primary functionalities.
3) 3-level table identifier is now for the name space purpose. Yes, join 
queries against two tables of the same db and table names but with different 
catalog names work well. Arbitrary levels of name spaces are not supported yet .

Thanks for your comments.   

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-10-03 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543349#comment-15543349
 ] 

holdenk commented on SPARK-16589:
-

Is this something you are still investigating/working on actively?

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-03 Thread Aleksander Eskilson (JIRA)
Aleksander Eskilson created SPARK-17770:
---

 Summary: Make ObjectType SQL Type Public
 Key: SPARK-17770
 URL: https://issues.apache.org/jira/browse/SPARK-17770
 Project: Spark
  Issue Type: Wish
  Components: SQL
Reporter: Aleksander Eskilson
Priority: Minor


Currently Catalyst supports encoding custom classes represented as Java Beans 
(among others). This Java Bean implementation depends internally on Catalyst’s 
ObjectType extension of DataType. Right now, this class is private to the sql 
package [1]. However, its private scope makes it more difficult to write full 
custom encoders for other classes, themselves perhaps composed of additional 
objects.

Opening this class as public will facilitate the writing of custom encoders.

[1] -- 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17771) Allow start-master/slave scripts to start in the foreground

2016-10-03 Thread Mike Ihbe (JIRA)
Mike Ihbe created SPARK-17771:
-

 Summary: Allow start-master/slave scripts to start in the 
foreground
 Key: SPARK-17771
 URL: https://issues.apache.org/jira/browse/SPARK-17771
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1
Reporter: Mike Ihbe
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17771) Allow start-master/slave scripts to start in the foreground

2016-10-03 Thread Mike Ihbe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Ihbe updated SPARK-17771:
--
Description: 
Based on conversation from this thread: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-td19172.html

Some scheduler solutions like Nomad have a simple fork-exec execution model, 
and the daemonization causes the scheduler to lose track of the spark process. 
I'm proposing adding a SPARK_NO_DAEMONIZE environment variable that will 
trigger a switch in ./sbin/spark-daemon.sh to run the process in the 
foreground. 

This opens a question about whether or not to rename the bash script, but I 
think that's a potentially breaking change that we should avoid at this point.

> Allow start-master/slave scripts to start in the foreground
> ---
>
> Key: SPARK-17771
> URL: https://issues.apache.org/jira/browse/SPARK-17771
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Mike Ihbe
>Priority: Minor
>
> Based on conversation from this thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-td19172.html
> Some scheduler solutions like Nomad have a simple fork-exec execution model, 
> and the daemonization causes the scheduler to lose track of the spark 
> process. I'm proposing adding a SPARK_NO_DAEMONIZE environment variable that 
> will trigger a switch in ./sbin/spark-daemon.sh to run the process in the 
> foreground. 
> This opens a question about whether or not to rename the bash script, but I 
> think that's a potentially breaking change that we should avoid at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2016-10-03 Thread Ai Deng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543412#comment-15543412
 ] 

Ai Deng commented on SPARK-10063:
-

You can add this line for your SparkContext, and this change the EMR's hadoop 
config
{code:java}
sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version",
 "2")
{code}

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17679) Remove unnecessary Py4J ListConverter patch

2016-10-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17679.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15254
[https://github.com/apache/spark/pull/15254]

> Remove unnecessary Py4J ListConverter patch
> ---
>
> Key: SPARK-17679
> URL: https://issues.apache.org/jira/browse/SPARK-17679
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jason White
>Priority: Minor
> Fix For: 2.1.0
>
>
> In SPARK-6949 davies documented a couple of bugs with Py4J that prevented 
> Spark from registering a converter for date and datetime objects. Patched in 
> https://github.com/apache/spark/pull/5570.
> Specifically https://github.com/bartdag/py4j/issues/160 dealt with 
> ListConverter automatically converting bytearrays into ArrayList instead of 
> leaving it alone.
> Py4J #160 has since been fixed in Py4J, since the 0.9 release a couple of 
> months after Spark #5570. According to spark-core's pom.xml, we're using 
> 0.10.3.
> We should remove this patch on ListConverter since the upstream package no 
> longer has this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17771) Allow start-master/slave scripts to start in the foreground

2016-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17771.
---
Resolution: Fixed

> Allow start-master/slave scripts to start in the foreground
> ---
>
> Key: SPARK-17771
> URL: https://issues.apache.org/jira/browse/SPARK-17771
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: Mike Ihbe
>Priority: Minor
>
> Based on conversation from this thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-td19172.html
> Some scheduler solutions like Nomad have a simple fork-exec execution model, 
> and the daemonization causes the scheduler to lose track of the spark 
> process. I'm proposing adding a SPARK_NO_DAEMONIZE environment variable that 
> will trigger a switch in ./sbin/spark-daemon.sh to run the process in the 
> foreground. 
> This opens a question about whether or not to rename the bash script, but I 
> think that's a potentially breaking change that we should avoid at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17767:


Assignee: (was: Apache Spark)

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17767:


Assignee: Apache Spark

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>Assignee: Apache Spark
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543461#comment-15543461
 ] 

Apache Spark commented on SPARK-17767:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15336

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543466#comment-15543466
 ] 

Dongjoon Hyun commented on SPARK-17767:
---

Hi, [~alexliu68].
I made a PR for this issue. Is it a correct direction to help your situation in 
any way?

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16532) Provide a REST API for submitting and tracking status of jobs

2016-10-03 Thread Joao Vasques (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543469#comment-15543469
 ] 

Joao Vasques commented on SPARK-16532:
--

Yes, the idea is not about having a whole new API but to see if the one that is 
mentioned here (http://arturmkrtchyan.com/apache-spark-hidden-rest-api) is 
worth documenting or not. Thanks!

> Provide a REST API for submitting and tracking status of jobs
> -
>
> Key: SPARK-16532
> URL: https://issues.apache.org/jira/browse/SPARK-16532
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, Spark Core
>Affects Versions: 1.6.0
>Reporter: Joao Vasques
>Priority: Minor
>
> When I was looking at the Spark Master dashboard I noticed the existence of a 
> REST URL. After searching a bit I found there was a non-documented REST API 
> that allows me to submit jobs without using the spark-submit script and 
> Launcher class 
> [https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/launcher/package-summary.html].
> Why isn't this API publicly documented? Do you have any plans to have a REST 
> API to submit and monitor job status?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17772) Add helper testing methods for instance weighting

2016-10-03 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17772:


 Summary: Add helper testing methods for instance weighting
 Key: SPARK-17772
 URL: https://issues.apache.org/jira/browse/SPARK-17772
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


More and more ML algos are accepting instance weights. We keep replicating code 
to test instance weighting in every test suite, which will get out of hand 
rather quickly. We can and should implement some generic instance weight test 
helper methods so that we can reduce duplicated code and standardize these 
tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543488#comment-15543488
 ] 

Franck Tago commented on SPARK-17758:
-

[~hvanhovell] It should return on null in this case because the null  value  is 
a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543539#comment-15543539
 ] 

Alex Liu commented on SPARK-17767:
--

It looks good to me.  How about hive thrift server, will you patch it as well 
or provide example to set it?

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-03 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543555#comment-15543555
 ] 

Don Drake commented on SPARK-16845:
---

I just hit this bug as well.  Are there any suggested workarounds?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543613#comment-15543613
 ] 

Dongjoon Hyun commented on SPARK-17767:
---

Thank you for confirming.

For Spark Thrift Server, I'm going to create another JIRA issue if the PR is 
merged.

The current PR is focusing on supporting generic ExternalCatalog in Spark side.

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17759) FairSchedulableBuilder should avoid to create duplicate fair scheduler-pools.

2016-10-03 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-17759:
---
Affects Version/s: (was: 2.1.0)
   2.0.0

> FairSchedulableBuilder should avoid to create duplicate fair scheduler-pools.
> -
>
> Key: SPARK-17759
> URL: https://issues.apache.org/jira/browse/SPARK-17759
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Eren Avsarogullari
> Attachments: duplicate_pools.png, duplicate_pools2.png
>
>
> If _spark.scheduler.allocation.file_ has duplicate pools, all of them are 
> created when _SparkContext_ is initialized but just one of them is used and 
> the other ones look redundant. This causes _redundant pool_ creation and 
> needs to be fixed. 
> *Code to Reproduce* :
> {code:java}
> val conf = new 
> SparkConf().setAppName("spark-fairscheduler").setMaster("local")
> conf.set("spark.scheduler.mode", "FAIR")
> conf.set("spark.scheduler.allocation.file", 
> "src/main/resources/fairscheduler-duplicate-pools.xml")
> val sc = new SparkContext(conf)
> {code}
> *fairscheduler-duplicate-pools.xml* :
> The following sample just shows two default and duplicate_pool1 but this can 
> also be thought for N default and/or other duplicate pools.
> {code:xml}
> 
>   
> 0
> 1
> FIFO
> 
> 
> 0
> 1
> FIFO
> 
> 
> 1
> 1
> FAIR
> 
> 
> 2
> 2
> FAIR
> 
> 
> {code}
> *Debug Screenshot* :
> This means Pool.schedulableQueue(ConcurrentLinkedQueue[Schedulable]) has 4 
> pools as default, default, duplicate_pool1, duplicate_pool1 but 
> Pool.schedulableNameToSchedulable(ConcurrentHashMap[String, Schedulable]) has 
> default and duplicate_pool1 due to pool name as key so one of default and 
> duplicate_pool1 look as redundant and live in Pool.schedulableQueue.
> Please have a look for *attached screenshots*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543488#comment-15543488
 ] 

Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM:
---

[~hvanhovell] It should return non null in this case because the null  value  
is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?



was (Author: tafra...@gmail.com):
[~hvanhovell] It should return on null in this case because the null  value  is 
a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543488#comment-15543488
 ] 

Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM:
---

[~hvanhovell] It should return a  non null  value in this case because the null 
 value  is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?



was (Author: tafra...@gmail.com):
[~hvanhovell] It should return non null in this case because the null  value  
is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17767:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15777

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543733#comment-15543733
 ] 

Reynold Xin commented on SPARK-17767:
-

Guys - while I think this is very useful to do, I'm going to mark this as 
"later" for now. The reason is that there are a lot of things to consider 
before making this switch, including:

- The ExternalCatalog API is currently internal, and we can't just make it 
public without thinking about the consequences.
- SPARK-15777 We need to design this in the context of catalog federation and 
persistence.
- SPARK-15691 Refactoring of how we integrate with Hive.


> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543733#comment-15543733
 ] 

Reynold Xin edited comment on SPARK-17767 at 10/3/16 11:22 PM:
---

Guys - while I think this is very useful to do, I'm going to mark this as 
"later" for now. The reason is that there are a lot of things to consider 
before making this switch, including:

- The ExternalCatalog API is currently internal, and we can't just make it 
public without thinking about the consequences and whether this API is 
maintainable in the long run.
- SPARK-15777 We need to design this in the context of catalog federation and 
persistence.
- SPARK-15691 Refactoring of how we integrate with Hive.

This is not as simple as just submitting a PR to make it pluggable.


was (Author: rxin):
Guys - while I think this is very useful to do, I'm going to mark this as 
"later" for now. The reason is that there are a lot of things to consider 
before making this switch, including:

- The ExternalCatalog API is currently internal, and we can't just make it 
public without thinking about the consequences.
- SPARK-15777 We need to design this in the context of catalog federation and 
persistence.
- SPARK-15691 Refactoring of how we integrate with Hive.


> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17767.
---
Resolution: Later

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543738#comment-15543738
 ] 

Herman van Hovell commented on SPARK-17758:
---

Yeah, you have a point there. It seems to be merging the results the empty 
partition. We'll need to add a flag to {{Last}} aggregate, and figure out 
merging. This will be fun to test.

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17767) Spark SQL ExternalCatalog API custom implementation support

2016-10-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543802#comment-15543802
 ] 

Dongjoon Hyun commented on SPARK-17767:
---

Yep. I closed the PR.

> Spark SQL ExternalCatalog API custom implementation support
> ---
>
> Key: SPARK-17767
> URL: https://issues.apache.org/jira/browse/SPARK-17767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Alex Liu
>
> There is no way/easy way to configure Spark to use customized 
> ExternalCatalog.  Internal source code is hardcoded to use either hive or 
> in-memory metastore. Spark SQL thriftserver is hardcoded to use 
> HiveExternalCatalog. We should be able to create a custom external catalog 
> and thriftserver should be able to use it. Potentially Spark SQL thriftserver 
> shouldn't depend on Hive thriftserer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >