date:20141203


[ 
https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232738#comment-14232738
 ] 

Apache Spark commented on SPARK-4715:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3575

 ShuffleMemoryManager.tryToAcquire may return a negative value
 -

 Key: SPARK-4715
 URL: https://issues.apache.org/jira/browse/SPARK-4715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu

 Here is a unit test to demonstrate it:
 {code}
   test(threads should not be granted a negative size) {
 val manager = new ShuffleMemoryManager(1000L)
 manager.tryToAcquire(700L)
 val latch = new CountDownLatch(1)
 startThread(t1) {
manager.tryToAcquire(300L)
   latch.countDown()
 }
 latch.await() // Wait until `t1` calls `tryToAcquire`
 val granted = manager.tryToAcquire(300L)
 assert(0 === granted, granted is negative)
   }
 {code}
 It outputs 0 did not equal -200 granted is negative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4720) Remainder should also return null if the divider is 0.

2014-12-03 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-4720:


 Summary: Remainder should also return null if the divider is 0.
 Key: SPARK-4720
 URL: https://issues.apache.org/jira/browse/SPARK-4720
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


This is a follow-up of SPARK-4593.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4397) Reorganize 'implicit's to improve the API convenience


[ 
https://issues.apache.org/jira/browse/SPARK-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232840#comment-14232840
 ] 

Apache Spark commented on SPARK-4397:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3580

 Reorganize 'implicit's to improve the API convenience
 -

 Key: SPARK-4397
 URL: https://issues.apache.org/jira/browse/SPARK-4397
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.3.0


 As I said here, 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAPn6-YTeUwGqvGady=vUjX=9bl_re7wb5-delbvfja842qm...@mail.gmail.com%3E
 many people asked how to convert a RDD to a PairRDDFunctions.
 If we can reorganize the `implicit`s properly, the API will more convenient 
 without importing SparkContext._ explicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode


[ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232778#comment-14232778
 ] 

Apache Spark commented on SPARK-4694:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3576

 Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
 yarn-client mode
 -

 Key: SPARK-4694
 URL: https://issues.apache.org/jira/browse/SPARK-4694
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: SaintBacchus

 Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
 problem that the driver can't exit by itself.
 To reappear it, you can do as fellow:
 1.use yarn HA mode and set am.maxAttemp = 1for convenience
 2.kill the active resouce manager in cluster
 The expect result is just failed, because the maxAttemp was 1.
 But the actual result is that: all executor was ended but the driver was 
 still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3391) Support attaching more than 1 EBS volumes


 [ 
https://issues.apache.org/jira/browse/SPARK-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3391.

   Resolution: Fixed
Fix Version/s: 1.2.0

This was merged.

 Support attaching more than 1 EBS volumes
 -

 Key: SPARK-3391
 URL: https://issues.apache.org/jira/browse/SPARK-3391
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.2.0


 Currently spark-ec2 supports attaching only one EBS volume.
 Attaching multiple EBS volumes can increase the aggregate throughput of EBS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

[
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232802#comment-14232802
]

Micael Capitão commented on SPARK-3553:
---

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir
hdfs:///user/altaia/cdrs/stream. It is running on YARN and uses
checkpointing. For now the application only reads the files and prints the
number of read lines.

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by
renaming it to end with .gz it is processed too.
[7] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another
file, it is not detected and stays like that.

Spark Streaming app streams files that have already been streamed in an
endless loop

Key: SPARK-3553
URL: https://issues.apache.org/jira/browse/SPARK-3553
Project: Spark
Issue Type: Bug
Components: Streaming
Affects Versions: 1.0.1
Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
Labels: S3, Streaming, YARN

We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB
of RAM each.
The app streams from a directory in S3 which is constantly being written;
this is the line of code that achieves that:
val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true )
The purpose of using fileStream instead of textFileStream is to customize the
way that spark handles existing files when the process starts. We want to
process just the new files that are added after the process launched and omit
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4
or 5. We can see in the streaming UI how the stages are executed successfully
in the executors, one for each file that is processed. But when we try to add
a larger number of files, we face a strange behavior; the application starts
streaming files that have already been streamed.
For example, I add 20 files to s3. The files are processed in 3 batches. The
first batch processes 7 files, the second 8 and the third 5. No more files
are added to S3 at this point, but spark start repeating these phases
endlessly with the same files.
Any thoughts what can be causing this?
Regards,
Easyb

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4718) spark-ec2 script creates empty spark folder

2014-12-03 Thread Ignacio Blasco Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232795#comment-14232795
 ] 

Ignacio Blasco Lopez commented on SPARK-4718:
-

Tested with 
spark-1.1.1-bin-hadoop2.3
spark-1.1.1-bin-hadoop2.4

 spark-ec2 script creates empty spark folder
 ---

 Key: SPARK-4718
 URL: https://issues.apache.org/jira/browse/SPARK-4718
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
Reporter: Ignacio Blasco Lopez

 spark-ec2 script in version 1.1.1 creates a spark folder with only conf 
 directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4714) Checking block is null or not after having gotten info.lock in remove block method


[ 
https://issues.apache.org/jira/browse/SPARK-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232717#comment-14232717
 ] 

Apache Spark commented on SPARK-4714:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/3574

 Checking block is null or not after having gotten info.lock in remove block 
 method
 --

 Key: SPARK-4714
 URL: https://issues.apache.org/jira/browse/SPARK-4714
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: SuYan
Priority: Minor

 in removeBlock()/ dropOldBlock()/ dropFromMemory()
 all have the same logic:
 1. info = blockInfo.get(id)
 2. if (info != null)
 3. info.synchronized
 there may be a possibility that while one thread got info.lock while the 
 previous thread already removed from blockinfo in info.lock. 
 but one thing in current code,  That not check info is null or not, while get 
 info.lock to remove block, will not cause any errors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop


 [ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micael Capitão updated SPARK-3553:
--
Comment: was deleted

(was: I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir 
hdfs:///user/altaia/cdrs/stream

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by 
renaming it to end with .gz it is processed too.
[7] 
hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially 
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another 
file, it is not detected and stays like that.)

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-03 Thread Ankur Dave (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232734#comment-14232734
 ] 

Ankur Dave commented on SPARK-4672:
---

[~jerrylead] Thanks for investigating this bug and the excellent explanation. 
Now that the PRs are merged, can you confirm that the bug is fixed for you? I 
haven't yet been able to reproduce it locally.

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be detailed in Phase 3) are the Suspect 
 Zero. They together build another serialization chain that can bypass the 
 broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
 the serialization chain can be as long as the original lineage before 
 checkpoint().
 Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
 OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
 can still access the previous RDDs through the (1)-(2) reference chain. As a 
 result, the checkpoint() action is meaningless and the lineage is as long as 
 that before. 
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
 The (1)-(2) chain can be observed in the debug view (in Figure 2).
 {code:borderStyle=solid}
 _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - 
 partitionsRDD:MapPartitionsRDD - RDDs in  the previous iterations
 {code}
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
 More description: While a RDD is being serialized, its f function 
 {code:borderStyle=solid}
 e.g., f:

[jira] [Updated] (SPARK-2456) Scheduler refactoring


 [ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2456:
---
Assignee: (was: Reynold Xin)

 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4718) spark-ec2 script creates empty spark folder

2014-12-03 Thread Ignacio Blasco Lopez (JIRA)

Ignacio Blasco Lopez created SPARK-4718:
---

 Summary: spark-ec2 script creates empty spark folder
 Key: SPARK-4718
 URL: https://issues.apache.org/jira/browse/SPARK-4718
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
Reporter: Ignacio Blasco Lopez


spark-ec2 script in version 1.1.1 creates a spark folder with only conf 
directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4720) Remainder should also return null if the divider is 0.


[ 
https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232848#comment-14232848
 ] 

Apache Spark commented on SPARK-4720:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3581

 Remainder should also return null if the divider is 0.
 --

 Key: SPARK-4720
 URL: https://issues.apache.org/jira/browse/SPARK-4720
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 This is a follow-up of SPARK-4593.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD


[ 
https://issues.apache.org/jira/browse/SPARK-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232804#comment-14232804
 ] 

Apache Spark commented on SPARK-4719:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3578

 Consolidate various narrow dep RDD classes with MapPartitionsRDD
 

 Key: SPARK-4719
 URL: https://issues.apache.org/jira/browse/SPARK-4719
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin

 Seems like we don't really need MappedRDD, MappedValuesRDD, 
 FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented 
 directly using MapPartitionsRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4716) Avoid shuffle when all-to-all operation has single input and output partition

2014-12-03 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-4716:
-

 Summary: Avoid shuffle when all-to-all operation has single input 
and output partition
 Key: SPARK-4716
 URL: https://issues.apache.org/jira/browse/SPARK-4716
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


I encountered an application that performs joins on a bunch of small RDDs, 
unions the results, and then performs larger aggregations across them.  Many of 
these small RDDs fit in a single partition.  For these operations with only a 
single partition, there's no reason to write data to disk and then fetch it 
over a socket.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop


[ 
https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232783#comment-14232783
 ] 

Apache Spark commented on SPARK-4717:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3577

 Optimize BLAS library to avoid de-reference multiple times in loop
 --

 Key: SPARK-4717
 URL: https://issues.apache.org/jira/browse/SPARK-4717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai

 Have a local reference to `values` and `indices` array in the `Vector` object 
 so JVM can locate the value with one operation call. See `SPARK-4581` for 
 similar optimization, and the bytecode analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-03 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-4672.
---
Resolution: Fixed

Issue resolved by pull request 3545
[https://github.com/apache/spark/pull/3545]

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be detailed in Phase 3) are the Suspect 
 Zero. They together build another serialization chain that can bypass the 
 broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
 the serialization chain can be as long as the original lineage before 
 checkpoint().
 Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
 OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
 can still access the previous RDDs through the (1)-(2) reference chain. As a 
 result, the checkpoint() action is meaningless and the lineage is as long as 
 that before. 
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
 The (1)-(2) chain can be observed in the debug view (in Figure 2).
 {code:borderStyle=solid}
 _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - 
 partitionsRDD:MapPartitionsRDD - RDDs in  the previous iterations
 {code}
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
 More description: While a RDD is being serialized, its f function 
 {code:borderStyle=solid}
 e.g., f: (Iterator[A], Iterator[B]) = Iterator[V]) in ZippedPartitionsRDD2
 {code}
 will be serialized too. This action will be very dangerous if the f closure 
 has a member

[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2253:
---
Fix Version/s: 1.3.0

 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-2253
 URL: https://issues.apache.org/jira/browse/SPARK-2253
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, Aggregator should just turn off partial aggregation. This 
 reduces memory usage for high cardinality aggregations.
 This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-12-03 Thread Aniket Bhatnagar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232719#comment-14232719
 ] 

Aniket Bhatnagar commented on SPARK-3638:
-

Yes. You may want to open another JIRA ticket for having kinesis pre build 
packages in spark download page.

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies
 Fix For: 1.1.1, 1.2.0


 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 {code}
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 {code}
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop

2014-12-03 Thread DB Tsai (JIRA)

DB Tsai created SPARK-4717:
--

 Summary: Optimize BLAS library to avoid de-reference multiple 
times in loop
 Key: SPARK-4717
 URL: https://issues.apache.org/jira/browse/SPARK-4717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


Have a local reference to `values` and `indices` array in the `Vector` object 
so JVM can locate the value with one operation call. See `SPARK-4581` for 
similar optimization, and the bytecode analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-03 Thread SaintBacchus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232724#comment-14232724
 ] 

SaintBacchus commented on SPARK-4694:
-

Thanks for reply. [~vanzin] the problem is very sure: the scheduler backend was 
aware of the AM had been exited so it call sc.stop to exit the driver process 
but there was a user thread which was still alive and cause this problem.
To fix this, just using System.exit(-1) instead of the sc.stop so that jvm will 
not wait all the user threads being down and exit clearly.
Can I use System.exit() in spark code?

 Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
 yarn-client mode
 -

 Key: SPARK-4694
 URL: https://issues.apache.org/jira/browse/SPARK-4694
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: SaintBacchus

 Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
 problem that the driver can't exit by itself.
 To reappear it, you can do as fellow:
 1.use yarn HA mode and set am.maxAttemp = 1for convenience
 2.kill the active resouce manager in cluster
 The expect result is just failed, because the maxAttemp was 1.
 But the actual result is that: all executor was ended but the driver was 
 still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop


[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232801#comment-14232801
 ] 

Micael Capitão commented on SPARK-3553:
---

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir 
hdfs:///user/altaia/cdrs/stream

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by 
renaming it to end with .gz it is processed too.
[7] 
hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially 
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another 
file, it is not detected and stays like that.

 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2253:
---
Assignee: (was: Reynold Xin)

 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-2253
 URL: https://issues.apache.org/jira/browse/SPARK-2253
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
 Fix For: 1.3.0


 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, Aggregator should just turn off partial aggregation. This 
 reduces memory usage for high cardinality aggregations.
 This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4085) Job will fail if a shuffle file that's read locally gets deleted


[ 
https://issues.apache.org/jira/browse/SPARK-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232839#comment-14232839
 ] 

Apache Spark commented on SPARK-4085:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3579

 Job will fail if a shuffle file that's read locally gets deleted
 

 Key: SPARK-4085
 URL: https://issues.apache.org/jira/browse/SPARK-4085
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Reynold Xin
Priority: Critical

 This commit: 
 https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a
  changed the behavior of fetching local shuffle blocks such that if a shuffle 
 block is not found locally, the shuffle block is no longer marked as failed, 
 and a fetch failed exception is not thrown (this is because the catch block 
 here won't ever be invoked: 
 https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-e6e1631fa01e17bf851f49d30d028823R202
  because the exception called from getLocalFromDisk() doesn't get thrown 
 until next() gets called on the iterator).
 [~rxin] [~matei] it looks like you guys changed the test for this to catch 
 the new exception that gets thrown 
 (https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-9c2e1918319de967045d04caf813a7d1R93).
   Was that intentional?  Because the new exception is a SparkException and 
 not a FetchFailedException, jobs with missing local shuffle data will now 
 fail, rather than having the map stage get retried.
 This problem is reproducible with this test case:
 {code}
   test(hash shuffle manager recovers when local shuffle files get deleted) {
 val conf = new SparkConf(false)
 conf.set(spark.shuffle.manager, hash)
 sc = new SparkContext(local, test, conf)
 val rdd = sc.parallelize(1 to 10, 2).map((_, 1)).reduceByKey(_+_)
 rdd.count()
 // Delete one of the local shuffle blocks.
 sc.env.blockManager.diskBlockManager.getFile(new ShuffleBlockId(0, 0, 
 0)).delete()
 rdd.count()
   }
 {code}
 which will fail on the second rdd.count().
 This is a regression from 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value

2014-12-03 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-4715:
---

 Summary: ShuffleMemoryManager.tryToAcquire may return a negative 
value
 Key: SPARK-4715
 URL: https://issues.apache.org/jira/browse/SPARK-4715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


Here is a unit test to demonstrate it:

{code}
  test(threads should not be granted a negative size) {
val manager = new ShuffleMemoryManager(1000L)
manager.tryToAcquire(700L)

val latch = new CountDownLatch(1)
startThread(t1) {
   manager.tryToAcquire(300L)
  latch.countDown()
}
latch.await() // Wait until `t1` calls `tryToAcquire`

val granted = manager.tryToAcquire(300L)
assert(0 === granted, granted is negative)
  }
{code}

It outputs 0 did not equal -200 granted is negative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-03 Thread Lijie Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232845#comment-14232845
 ] 

Lijie Xu commented on SPARK-4672:
-

Thank you [~ankurdave]. Yes, the StackOverflow error disappears once the PRs 
are merged and checkpoint() is performed every N iterations. However, If the 
lineage of iterative algorithms are too long, we still need to do checkpoint() 
manually to cut off the lineage to avoid this error. Moreover, the fix 
suggestion given by [~jason.dai] is fine since it is a general problem. We need 
more elegant methods to avoid the long chain of task's serialization().

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be detailed in Phase 3) are the Suspect 
 Zero. They together build another serialization chain that can bypass the 
 broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
 the serialization chain can be as long as the original lineage before 
 checkpoint().
 Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
 OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
 can still access the previous RDDs through the (1)-(2) reference chain. As a 
 result, the checkpoint() action is meaningless and the lineage is as long as 
 that before. 
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
 The (1)-(2) chain can be observed in the debug view (in Figure 2).
 {code:borderStyle=solid}
 _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) -

[jira] [Created] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD

Reynold Xin created SPARK-4719:
--

 Summary: Consolidate various narrow dep RDD classes with 
MapPartitionsRDD
 Key: SPARK-4719
 URL: https://issues.apache.org/jira/browse/SPARK-4719
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin


Seems like we don't really need MappedRDD, MappedValuesRDD, 
FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented 
directly using MapPartitionsRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4710) Fix MLlib compilation warnings


 [ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4710.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3568
[https://github.com/apache/spark/pull/3568]

 Fix MLlib compilation warnings
 --

 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4710) Fix MLlib compilation warnings


 [ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4710:
-
Assignee: Joseph K. Bradley

 Fix MLlib compilation warnings
 --

 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample


 [ 
https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4708:
-
Assignee: DB Tsai

 Make k-mean runs two/three times faster with dense/sparse sample
 

 Key: SPARK-4708
 URL: https://issues.apache.org/jira/browse/SPARK-4708
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.2.0


 Note that the usage of `breezeSquaredDistance` in 
 `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
 path, and breezeSquaredDistance is slow. We should replace it with our own 
 implementation.
 Here is the benchmark against mnist8m dataset.
 Before
 DenseVector: 70.04secs
 SparseVector: 59.05secs
 With this PR
 DenseVector: 30.58secs
 SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample


 [ 
https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4708.
--
  Resolution: Implemented
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

 Make k-mean runs two/three times faster with dense/sparse sample
 

 Key: SPARK-4708
 URL: https://issues.apache.org/jira/browse/SPARK-4708
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.2.0


 Note that the usage of `breezeSquaredDistance` in 
 `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
 path, and breezeSquaredDistance is slow. We should replace it with our own 
 implementation.
 Here is the benchmark against mnist8m dataset.
 Before
 DenseVector: 70.04secs
 SparseVector: 59.05secs
 With this PR
 DenseVector: 30.58secs
 SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4721) Improve first thread to put block failed

2014-12-03 Thread SuYan (JIRA)

SuYan created SPARK-4721:


 Summary: Improve first thread to put block failed
 Key: SPARK-4721
 URL: https://issues.apache.org/jira/browse/SPARK-4721
 Project: Spark
  Issue Type: Improvement
Reporter: SuYan


In current code, it assumes that multi-thread try to put same blockID block in 
blockManager, the thread that first put info in blockinfos to do the put 
process, and others will wait until the put in failed or success.

it's ok in put success, but if fails, have some problem:
1. the failed thread will remove info from blockinfo
2. other threads wake up, and use the old info.synchronized to try put
3. and if success, mark success will tell not in pending status, and “mark 
success” failed. all other remaining threads will do the same thing: got 
info.syn and mark success or failed even that have one success.

first, I can't understand why remove info from blockinfos while there have 
other threads was wait. the comment tell us is for other threads to create new 
block info. but block info is just a ID and level, use the old one and the new 
one is doesn't matters if there any waits threads.

second, how about if there first threads is failed, other waits thread can do 
the same process one by one but need less than all .

or just if first thread is failed, all other threads log a warning and return 
after waking up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4721) Improve first thread to put block failed

[
https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232889#comment-14232889
]

Apache Spark commented on SPARK-4721:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/3582

Improve first thread to put block failed

Key: SPARK-4721
URL: https://issues.apache.org/jira/browse/SPARK-4721
Project: Spark
Issue Type: Improvement
Reporter: SuYan

In current code, it assumes that multi-thread try to put same blockID block
in blockManager, the thread that first put info in blockinfos to do the put
process, and others will wait until the put in failed or success.
it's ok in put success, but if fails, have some problem:
1. the failed thread will remove info from blockinfo
2. other threads wake up, and use the old info.synchronized to try put
3. and if success, mark success will tell not in pending status, and “mark
success” failed. all other remaining threads will do the same thing: got
info.syn and mark success or failed even that have one success.
first, I can't understand why remove info from blockinfos while there have
other threads was wait. the comment tell us is for other threads to create
new block info. but block info is just a ID and level, use the old one and
the new one is doesn't matters if there any waits threads.
second, how about if there first threads is failed, other waits thread can do
the same process one by one but need less than all .
or just if first thread is failed, all other threads log a warning and return
after waking up.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn

2014-12-03 Thread Arthur Andres (JIRA)

Arthur Andres created SPARK-4722:


 Summary: StreamingLinearRegression should return a DStream of 
weights when calling trainOn
 Key: SPARK-4722
 URL: https://issues.apache.org/jira/browse/SPARK-4722
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Arthur Andres
Priority: Minor


When training a model with a stream of new data (Spark Streaming + Spark 
Mlllib),  the weights (and the other part of the regression model) update at 
every iterations.
At the moment the only output we can get is the prediction when calling 
predictOn (class StreamingLinearRegression)
It would be a nice improvement if trainOn would return a Dstream of weights 
(and any other underlying model data) so we can access it and see it evolve. At 
the moment they are only outputted in the log
For example this could then be saved so when reloading the application we can 
access this information without having to train the model again.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn


[ 
https://issues.apache.org/jira/browse/SPARK-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232920#comment-14232920
 ] 

Xiangrui Meng commented on SPARK-4722:
--

[~Arthur][ You can use `StreamingLinearRegression.model` to get the latest 
model. It may be expensive and unnecessary to make predictOn return a DStream 
of model weights. If you want to re-use the previously trained model, you save 
the last model coefficients in the first run and then set initial weights in 
the second run.

 StreamingLinearRegression should return a DStream of weights when calling 
 trainOn
 -

 Key: SPARK-4722
 URL: https://issues.apache.org/jira/browse/SPARK-4722
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Arthur Andres
Priority: Minor
  Labels: mllib, regression, streaming

 When training a model with a stream of new data (Spark Streaming + Spark 
 Mlllib),  the weights (and the other part of the regression model) update at 
 every iterations.
 At the moment the only output we can get is the prediction when calling 
 predictOn (class StreamingLinearRegression)
 It would be a nice improvement if trainOn would return a Dstream of weights 
 (and any other underlying model data) so we can access it and see it evolve. 
 At the moment they are only outputted in the log
 For example this could then be saved so when reloading the application we can 
 access this information without having to train the model again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4722) StreamingLinearRegression should return a DStream of weights when calling trainOn

2014-12-03 Thread Arthur Andres (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232929#comment-14232929
 ] 

Arthur Andres commented on SPARK-4722:
--

I understand your point about it being heavy but still believe that it'd be 
useful Maybe a function could be added that creates and returns a stream of 
weights? As far as the model() function is concerned, is it safe/possible to 
access it after the streaming has started ? Thanks.

 StreamingLinearRegression should return a DStream of weights when calling 
 trainOn
 -

 Key: SPARK-4722
 URL: https://issues.apache.org/jira/browse/SPARK-4722
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Arthur Andres
Priority: Minor
  Labels: mllib, regression, streaming

 When training a model with a stream of new data (Spark Streaming + Spark 
 Mlllib),  the weights (and the other part of the regression model) update at 
 every iterations.
 At the moment the only output we can get is the prediction when calling 
 predictOn (class StreamingLinearRegression)
 It would be a nice improvement if trainOn would return a Dstream of weights 
 (and any other underlying model data) so we can access it and see it evolve. 
 At the moment they are only outputted in the log
 For example this could then be saved so when reloading the application we can 
 access this information without having to train the model again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4723) To abort the stages which have attempted some times

2014-12-03 Thread YanTang Zhai (JIRA)

YanTang Zhai created SPARK-4723:
---

 Summary: To abort the stages which have attempted some times
 Key: SPARK-4723
 URL: https://issues.apache.org/jira/browse/SPARK-4723
 Project: Spark
  Issue Type: Improvement
Reporter: YanTang Zhai
Priority: Minor


For some reason, some stages may attempt many times. A threshold could be added 
and the stages which have attempted more than the threshold could be aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings


[ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233007#comment-14233007
 ] 

Sean Owen commented on SPARK-4710:
--

(PS, I have a JIRA/PR to clean up most all of the current build warnings: 
http://issues.apache.org/jira/browse/SPARK-4297 )

 Fix MLlib compilation warnings
 --

 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4724) JavaNetworkWordCount.java has a wrong import

Emre Sevinç created SPARK-4724:
--

 Summary: JavaNetworkWordCount.java has a wrong import
 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor


JavaNetworkWordCount.java has a wrong import.

[Line 
28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
 in  
[spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
 reads as

{code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
import org.apache.spark.streaming.Durations;
{code}

But according to the 
[documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
 it should be 
[Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
 (or 
[Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
 not **Durations** .

Also [line 
60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
 should change accordingly because currently it reads as:

{code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
Durations.seconds(1));
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233031#comment-14233031
 ] 

Yana Kadiyska commented on SPARK-4702:
--

I'm investigating the possibility that this is caused by Hive being switched to 
Hive0.13 by default. Building with 0.12 profile now, will close if this goes 
away

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4725) Re-think custom shuffle serializers for vertex messages

2014-12-03 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-4725:
---

 Summary: Re-think custom shuffle serializers for vertex messages
 Key: SPARK-4725
 URL: https://issues.apache.org/jira/browse/SPARK-4725
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor


These serializers are removed in Spark-3649 because some type mismatch
errors occur in SortShuffleWriter.

https://www.mail-archive.com/commits@spark.apache.org/msg04125.html

However, messages between executors might be of critical performance issues in 
PageRank and other communication-intensive graph tasks.
Ankur reported that the removal caused a slowdown and the increase
of per-iteration communications in the commit log.

I made a patch to avoid the type-mismatch error in 
https://github.com/maropu/spark/commit/20e74f0e41ed99cb0a89ec5e5fc0e3c9e3f1038e#diff-68f4d319d5a58cbe0729476e0cb8594aR39




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


[ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233034#comment-14233034
 ] 

Sean Owen commented on SPARK-4724:
--

No, this is correct. {{Durations}} is an object, with helper methods that 
create a {{Duration}}. They both exist.

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop


 [ 
https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4717.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3577
[https://github.com/apache/spark/pull/3577]

 Optimize BLAS library to avoid de-reference multiple times in loop
 --

 Key: SPARK-4717
 URL: https://issues.apache.org/jira/browse/SPARK-4717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
 Fix For: 1.2.0


 Have a local reference to `values` and `indices` array in the `Vector` object 
 so JVM can locate the value with one operation call. See `SPARK-4581` for 
 similar optimization, and the bytecode analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4717) Optimize BLAS library to avoid de-reference multiple times in loop


 [ 
https://issues.apache.org/jira/browse/SPARK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4717:
-
Assignee: DB Tsai

 Optimize BLAS library to avoid de-reference multiple times in loop
 --

 Key: SPARK-4717
 URL: https://issues.apache.org/jira/browse/SPARK-4717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.2.0


 Have a local reference to `values` and `indices` array in the `Vector` object 
 so JVM can locate the value with one operation call. See `SPARK-4581` for 
 similar optimization, and the bytecode analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


[ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233053#comment-14233053
 ] 

Emre Sevinç commented on SPARK-4724:


Then how do I import {{org.apache.spark.streaming.Durations}} and then use it 
as {{JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
Durations.second(1));}} ? 

Because I can't import and compile it as above (I receive {{cannot find 
symbol}} errors when I try to), whereas I can import 
{{org.apache.spark.streaming.Durations}} or 
{{org.apache.spark.streaming.Seconds}} and then use them. 

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233067#comment-14233067
 ] 

Xiangrui Meng commented on SPARK-4001:
--

[~jackylk] Could you share some performance testing results?

 Add Apriori algorithm to Spark MLlib
 

 Key: SPARK-4001
 URL: https://issues.apache.org/jira/browse/SPARK-4001
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jacky Li
Assignee: Jacky Li

 Apriori is the classic algorithm for frequent item set mining in a 
 transactional data set.  It will be useful if Apriori algorithm is added to 
 MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings


[ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233071#comment-14233071
 ] 

Xiangrui Meng commented on SPARK-4710:
--

[~srowen] Sorry I didn't see your PR. Since Joseph's was merged, do you mind 
updating yours? Thanks!

 Fix MLlib compilation warnings
 --

 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings


[ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233075#comment-14233075
 ] 

Sean Owen commented on SPARK-4710:
--

[~mengxr] It looks like it didn't overlap -- maybe these were newer than my PR. 
This is the PR I'm suggesting at the moment, which still looks mergeable: 
https://github.com/apache/spark/pull/3157

 Fix MLlib compilation warnings
 --

 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


[ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233078#comment-14233078
 ] 

Sean Owen commented on SPARK-4724:
--

I believe the class is new in 1.2. You are looking at the example code as of 
master, but the class documentaiton as of 1.1.

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


[ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233100#comment-14233100
 ] 

Emre Sevinç commented on SPARK-4724:


OK, now I see. What confused me was the word _latest_ in URL of the Javadoc: 
https://spark.apache.org/docs/latest/api/java/index.html. I failed to realize 
that the title of the HTML page indicates **Spark 1.1.1 JavaDoc**, and that 
_latest_ referred to the latest release of 1.1.x series, and not the 1.2.x 
development branch (master) as of now. 

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4726) NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions

2014-12-03 Thread Dmitriy Makarenko (JIRA)

Dmitriy Makarenko created SPARK-4726:


 Summary: NotSerializableException thrown on 
SystemDefaultHttpClient with stack not related to my functions
 Key: SPARK-4726
 URL: https://issues.apache.org/jira/browse/SPARK-4726
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
Reporter: Dmitriy Makarenko


I get this stacktrace  that doesn't contain any of my function - 
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.http.impl.client.SystemDefaultHttpClient
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:714)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:698)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1198)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

As I know SystemDefaultHttpClient is used inside the SolrJ library that I use, 
but it is in the separate Jar from my project.
All of mine classes are Serializable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


 [ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emre Sevinç resolved SPARK-4724.

Resolution: Not a Problem

Not a Problem. It was a misunderstanding on my side.

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4724) JavaNetworkWordCount.java has a wrong import


 [ 
https://issues.apache.org/jira/browse/SPARK-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emre Sevinç closed SPARK-4724.
--

 JavaNetworkWordCount.java has a wrong import
 

 Key: SPARK-4724
 URL: https://issues.apache.org/jira/browse/SPARK-4724
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.1.0
Reporter: Emre Sevinç
Priority: Minor
  Labels: examples

 JavaNetworkWordCount.java has a wrong import.
 [Line 
 28|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L28]
  in  
 [spark/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java]
  reads as
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 import org.apache.spark.streaming.Durations;
 {code}
 But according to the 
 [documentation|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/package-summary.html],
  it should be 
 [Duration|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Duration.html]
  (or 
 [Seconds|https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/Seconds.html]),
  not **Durations** .
 Also [line 
 60|https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java#L60]
  should change accordingly because currently it reads as:
 {code:title=JavaNetworkWordCount.java|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
 JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
 Durations.seconds(1));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-03 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233107#comment-14233107
 ] 

Travis Galoppo commented on SPARK-4156:
---

I have modified the cluster initialization strategy to derive an initial 
covariance matrix from the sample points used to initialize the clusters; this 
initial covariance matrix has the element-wise variance of the sample points on 
the diagonal.  The final computed covariance matrix is not constrained to be 
diagonal.

I tested  this with the S1 dataset [~MeethuMathew] referenced above; while it 
does fix the problem of effectively finding no clusters, I find that the 
results are still better when the input is scaled as I mentioned above.  It 
might be worthwhile to allow the user to provide a pre-initialized model to 
accomodate various initialization strategies, and provide the current 
functionality as a default. Thoughts?

Also, I have fixed the defect in DenseGmmEM whereby it was ignoring the delta 
parameter.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc


[ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233124#comment-14233124
 ] 

Sean Owen commented on SPARK-4690:
--

No, it is using quadratic probing. It adds {{delta}} each time, but {{delta}} 
grows by 1 each time too. The offsets are thus 1, 3, 6, 10 ... or the sequence 
n*(n+1)/2 which is quadratic.

 AppendOnlyMap seems not using Quadratic probing as the JavaDoc
 --

 Key: SPARK-4690
 URL: https://issues.apache.org/jira/browse/SPARK-4690
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Yijie Shen
Priority: Minor

 org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
 This implementation uses quadratic probing with a power-of-2 
 However, the probe procedure in face with a hash collision is just using 
 linear probing. the code below:
 val delta = i
 pos = (pos + delta)  mask
 i += 1
 Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 3:53 PM:
---

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';


was (Author: yanakad):
Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SPARK-4727) Add dimensional RDDs (time series, spatial)

2014-12-03 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-4727:
-

 Summary: Add dimensional RDDs (time series, spatial)
 Key: SPARK-4727
 URL: https://issues.apache.org/jira/browse/SPARK-4727
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: RJ Nowling


Certain types of data (times series, spatial) can benefit from specialized 
RDDs.  I'd like to open a discussion about this.

For example, time series data should be ordered by time and would benefit from 
operations like:
* Subsampling (taking every n data points)
* Signal processing (correlations, FFTs, filtering)
* Windowing functions

Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
For example, path finding algorithms can optimized by only comparing points 
within a set distance, which can be computed more efficiently by partitioning 
data into a grid.

Although the operations on time series and spatial data may be different, there 
is some commonality in the sense of the data having ordered dimensions and the 
implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib

2014-12-03 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-4728:
-

 Summary: Add exponential, log normal, and gamma distributions to 
data generator to MLlib
 Key: SPARK-4728
 URL: https://issues.apache.org/jira/browse/SPARK-4728
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor


MLlib supports sampling from normal, uniform, and Poisson distributions.  

I'd like to add support for sampling from exponential, gamma, and log normal 
distributions, using the features of math3 like the other generators.

Please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4729) Add time series subsampling to MLlib

2014-12-03 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-4729:
-

 Summary: Add time series subsampling to MLlib
 Key: SPARK-4729
 URL: https://issues.apache.org/jira/browse/SPARK-4729
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: RJ Nowling
Priority: Minor


MLlib supports several time series functions.  The ability to subsample a time 
series (take every n data points) is missing. 

I'd like to add it, so please assign this to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233081#comment-14233081
 ] 

Yana Kadiyska edited comment on SPARK-4702 at 12/3/14 6:09 PM:
---

Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';


was (Author: yanakad):
Unfortunately I still see this error after building with  
./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 
-Phive-thriftserver -Phive-0.12.0

I do have a working build from master branch from October 24th where this 
scenario works. We are running CDH4.6 Hive0.10 
Beeline output from October:
Connected to: Hive (version 0.12.0-protobuf-2.5)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline output from RC:
Connected to: Hive (version 1.2.0)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ

I am wondering if  -Phive-0.12.0 is fully sufficient --not sure why the 
Connected to: version prints differently?

In particular, the query I tried is select count(*) from mytable where pkey 
='some-non-existant-key';

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at

[jira] [Assigned] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


 [ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-4552:
---

Assignee: Michael Armbrust

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Michael Armbrust
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


 [ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4552:

Priority: Blocker  (was: Major)

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


[ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233314#comment-14233314
 ] 

Michael Armbrust commented on SPARK-4552:
-

It turns out this manifests also when writing a query that doesn't match any 
partitions, which seems much more common.  I'm going to do a quick surgical fix 
given the proximity to the release.

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


 [ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4702.
-
Resolution: Duplicate

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233321#comment-14233321
 ] 

Marcelo Vanzin commented on SPARK-4694:
---

To answer your question, you can call System.exit() if you want. It's just 
recommended that it's done after you properly shutdown the SparkContext, 
otherwise Yarn won't report your app status correctly. But it seems your patch 
doesn't use System.exit(), so this is kinda moot.

 Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
 yarn-client mode
 -

 Key: SPARK-4694
 URL: https://issues.apache.org/jira/browse/SPARK-4694
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: SaintBacchus

 Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
 problem that the driver can't exit by itself.
 To reappear it, you can do as fellow:
 1.use yarn HA mode and set am.maxAttemp = 1for convenience
 2.kill the active resouce manager in cluster
 The expect result is just failed, because the maxAttemp was 1.
 But the actual result is that: all executor was ended but the driver was 
 still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429
 ] 

Matei Zaharia commented on SPARK-4690:
--

Yup, that's the definition of it.

 AppendOnlyMap seems not using Quadratic probing as the JavaDoc
 --

 Key: SPARK-4690
 URL: https://issues.apache.org/jira/browse/SPARK-4690
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Yijie Shen
Priority: Minor

 org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
 This implementation uses quadratic probing with a power-of-2 
 However, the probe procedure in face with a hash collision is just using 
 linear probing. the code below:
 val delta = i
 pos = (pos + delta)  mask
 i += 1
 Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-4690.

Resolution: Invalid

 AppendOnlyMap seems not using Quadratic probing as the JavaDoc
 --

 Key: SPARK-4690
 URL: https://issues.apache.org/jira/browse/SPARK-4690
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Yijie Shen
Priority: Minor

 org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
 This implementation uses quadratic probing with a power-of-2 
 However, the probe procedure in face with a hash collision is just using 
 linear probing. the code below:
 val delta = i
 pos = (pos + delta)  mask
 i += 1
 Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233344#comment-14233344
 ] 

Michael Armbrust commented on SPARK-4702:
-

Thanks for reporting.  As a workaround you should be able to SET 
spark.sql.hive.convertMetastoreParquet=false, but I'm going to try to fix this 
before the next RC.

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4697) System properties should override environment variables


[ 
https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233381#comment-14233381
 ] 

Andrew Or commented on SPARK-4697:
--

Hey did you search for a duplicate or related JIRA in the history? I'm pretty 
sure there's another one that suggests something similar

 System properties should override environment variables
 ---

 Key: SPARK-4697
 URL: https://issues.apache.org/jira/browse/SPARK-4697
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic

 I found some arguments in yarn module take environment variables before 
 system properties while the latter override the former in core module.
 This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2014-12-03 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233385#comment-14233385
 ] 

Sandy Ryza commented on SPARK-4687:
---

[~pwendell], do you think this is a reasonable API addition?  If so, I'll try 
and add it.

 SparkContext#addFile doesn't keep file folder information
 -

 Key: SPARK-4687
 URL: https://issues.apache.org/jira/browse/SPARK-4687
 Project: Spark
  Issue Type: Bug
Reporter: Jimmy Xiang

 Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
 a task starts. However, Utils#fetchFile puts all files under the Spart root 
 on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


[ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233393#comment-14233393
 ] 

Apache Spark commented on SPARK-4552:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3586

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233443#comment-14233443
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Michael, just wanted to point out that the workaround you suggested did indeed 
help when the partition is missing -- I get a count of 0. But it did break the 
otherwise working case when a partition is present:

java.lang.IllegalStateException: 
All the offsets listed in the split should be found in the file. expected: [4, 
4] 
found: {my schema dumped out here}
out of: [4, 121017555, 242333553, 363518600] in range 0, 134217728

It's possible that this is a very corner case -- we've added columns to our 
schema so it's possible that the parquet files are likely not symmetric (not 
quite sure what convertMetastoreParquet does under the hood). But wanted to 
point out that in our case the bug is truly a blocker (I'm hoping it makes it 
in 1.2, don't care if it makes it in the next RC or later)

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable


 [ 
https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-3926:
--

I am reopening as it is not actually serializable without a no-arg constructor. 
PR coming shortly, that copies the implementation from Scala's 
Wrappers.MapWrapper.

 result of JavaRDD collectAsMap() is not serializable
 

 Key: SPARK-3926
 URL: https://issues.apache.org/jira/browse/SPARK-3926
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0, 1.2.0
 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402
Reporter: Antoine Amend
Assignee: Sean Owen
 Fix For: 1.1.1, 1.2.0


 Using the Java API, I want to collect the result of a RDDString, String as 
 a HashMap using collectAsMap function:
 MapString, String map = myJavaRDD.collectAsMap();
 This works fine, but when passing this map to another function, such as...
 myOtherJavaRDD.mapToPair(new CustomFunction(map))
 ...this leads to the following error:
 Exception in thread main org.apache.spark.SparkException: Task not 
 serializable
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
   at org.apache.spark.rdd.RDD.map(RDD.scala:270)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99)
   at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44)
   ../.. MY CLASS ../..
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.io.NotSerializableException: 
 scala.collection.convert.Wrappers$MapWrapper
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 This seems to be due to WrapAsJava.scala being non serializable
 ../..
   implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match {
 //case JConcurrentMapWrapper(wrapped) = wrapped
 case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]]
 case _ = new MapWrapper(m)
   }
 ../..
 The workaround is to manually wrapper this map into another one (serialized)
 MapString, String map = myJavaRDD.collectAsMap();
 MapString, String tmp = new HashMapString, String(map);
 myOtherJavaRDD.mapToPair(new CustomFunction(tmp))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3926) result of JavaRDD collectAsMap() is not serializable


[ 
https://issues.apache.org/jira/browse/SPARK-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233453#comment-14233453
 ] 

Apache Spark commented on SPARK-3926:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3587

 result of JavaRDD collectAsMap() is not serializable
 

 Key: SPARK-3926
 URL: https://issues.apache.org/jira/browse/SPARK-3926
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0, 1.2.0
 Environment: CentOS / Spark 1.1 / Hadoop Hortonworks 2.4.0.2.1.2.0-402
Reporter: Antoine Amend
Assignee: Sean Owen
 Fix For: 1.1.1, 1.2.0


 Using the Java API, I want to collect the result of a RDDString, String as 
 a HashMap using collectAsMap function:
 MapString, String map = myJavaRDD.collectAsMap();
 This works fine, but when passing this map to another function, such as...
 myOtherJavaRDD.mapToPair(new CustomFunction(map))
 ...this leads to the following error:
 Exception in thread main org.apache.spark.SparkException: Task not 
 serializable
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
   at org.apache.spark.rdd.RDD.map(RDD.scala:270)
   at 
 org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:99)
   at org.apache.spark.api.java.JavaPairRDD.mapToPair(JavaPairRDD.scala:44)
   ../.. MY CLASS ../..
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.io.NotSerializableException: 
 scala.collection.convert.Wrappers$MapWrapper
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 This seems to be due to WrapAsJava.scala being non serializable
 ../..
   implicit def mapAsJavaMap[A, B](m: Map[A, B]): ju.Map[A, B] = m match {
 //case JConcurrentMapWrapper(wrapped) = wrapped
 case JMapWrapper(wrapped) = wrapped.asInstanceOf[ju.Map[A, B]]
 case _ = new MapWrapper(m)
   }
 ../..
 The workaround is to manually wrapper this map into another one (serialized)
 MapString, String map = myJavaRDD.collectAsMap();
 MapString, String tmp = new HashMapString, String(map);
 myOtherJavaRDD.mapToPair(new CustomFunction(tmp))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233457#comment-14233457
 ] 

Michael Armbrust commented on SPARK-4702:
-

Yana, I'm a little confused.  Were both of these cases working in 1.1?  As far 
as I understand, setting spark.sql.hive.convertMetastoreParquet=false should 
restore the behavior as of 1.1.

When convertMetastoreParquet is true, there is not currently support for 
heterogeneous schema.

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4701) Typo in sbt/sbt


 [ 
https://issues.apache.org/jira/browse/SPARK-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4701.

  Resolution: Fixed
   Fix Version/s: 1.1.2
  1.2.0
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.1.1, 1.2.0

 Typo in sbt/sbt
 ---

 Key: SPARK-4701
 URL: https://issues.apache.org/jira/browse/SPARK-4701
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.2.0, 1.1.2


 in sbt/sbt
 {noformat}
   -S-X   add -X to sbt's scalacOptions (-J is stripped)
 {noformat}
 but {{(-S is stripped)}} is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4691) code optimization for judgement


 [ 
https://issues.apache.org/jira/browse/SPARK-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4691:
-
Affects Version/s: 1.1.0

 code optimization for judgement
 ---

 Key: SPARK-4691
 URL: https://issues.apache.org/jira/browse/SPARK-4691
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: maji2014
Priority: Minor

 aggregator and mapSideCombine judgement in 
 HashShuffleWriter.scala 
 SortShuffleWriter.scala
 HashShuffleReader.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2143) Display Spark version on Driver web page


 [ 
https://issues.apache.org/jira/browse/SPARK-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2143.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

(This went in for branch 1.2 but not for the 1.1 branch)

 Display Spark version on Driver web page
 

 Key: SPARK-2143
 URL: https://issues.apache.org/jira/browse/SPARK-2143
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Jeff Hammerbacher
Priority: Critical
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value


 [ 
https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4715:
-
Affects Version/s: 1.1.0
 Assignee: Shixiong Zhu

 ShuffleMemoryManager.tryToAcquire may return a negative value
 -

 Key: SPARK-4715
 URL: https://issues.apache.org/jira/browse/SPARK-4715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu

 Here is a unit test to demonstrate it:
 {code}
   test(threads should not be granted a negative size) {
 val manager = new ShuffleMemoryManager(1000L)
 manager.tryToAcquire(700L)
 val latch = new CountDownLatch(1)
 startThread(t1) {
manager.tryToAcquire(300L)
   latch.countDown()
 }
 latch.await() // Wait until `t1` calls `tryToAcquire`
 val granted = manager.tryToAcquire(300L)
 assert(0 === granted, granted is negative)
   }
 {code}
 It outputs 0 did not equal -200 granted is negative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4715) ShuffleMemoryManager.tryToAcquire may return a negative value


 [ 
https://issues.apache.org/jira/browse/SPARK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4715.

  Resolution: Fixed
   Fix Version/s: 1.1.2
  1.2.0
Target Version/s: 1.2.0, 1.1.2

 ShuffleMemoryManager.tryToAcquire may return a negative value
 -

 Key: SPARK-4715
 URL: https://issues.apache.org/jira/browse/SPARK-4715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.2.0, 1.1.2


 Here is a unit test to demonstrate it:
 {code}
   test(threads should not be granted a negative size) {
 val manager = new ShuffleMemoryManager(1000L)
 manager.tryToAcquire(700L)
 val latch = new CountDownLatch(1)
 startThread(t1) {
manager.tryToAcquire(300L)
   latch.countDown()
 }
 latch.await() // Wait until `t1` calls `tryToAcquire`
 val granted = manager.tryToAcquire(300L)
 assert(0 === granted, granted is negative)
   }
 {code}
 It outputs 0 did not equal -200 granted is negative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233499#comment-14233499
 ] 

Michael Armbrust commented on SPARK-4702:
-

Also, have you tested with: https://github.com/apache/spark/pull/3586?

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1


[ 
https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233514#comment-14233514
 ] 

Yana Kadiyska commented on SPARK-4702:
--

Michael, I do not have a 1.1. In October I built master manually, I believe 
from commit d2987e8f7a2cb3bf971f381399d8efdccb51d3d2.
At that time both of types of queries worked, without setting 
spark.sql.hive.convertMetastoreParquet=false (I tested on a smaller cluster, 
will drop on the same one now to make sure there's not some data weirdness)

If you meant to say When convertMetastoreParquet is FALSE, there is not 
currently support for heterogeneous schema. then we are saying the same thing 
-- I didn't have to set this flag as the missing partitions where handled fine. 
Now missing partitions are broken, but setting  SET 
spark.sql.hive.convertMetastoreParquet=false breaks 99% case because my files 
have different # columns.

I have not tried the PR you mentioned, will try it now --in my case the issue 
is not an empty file, it's a missing directory -- our query does a 
partition=-mm query, where parquet files are laid out under -mm 
directories representing partitions. But will see if the issue is helped by 
that PR.

In any case, I am just hoping this works before the final release, not a 
particular rush

 Querying  non-existent partition produces exception in v1.2.0-rc1
 -

 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska

 Using HiveThriftServer2, when querying a non-existent partition I get an 
 exception rather than an empty result set. This seems to be a regression -- I 
 had an older build of master branch where this works. Build off of RC1.2 tag 
 produces the following:
 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: 
 java.lang.IllegalArgumentException: Can not create a Path from an empty string
 at 
 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at 
 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
 at 
 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
 at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
 at 
 org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
 at 
 org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
 at 
 org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
 at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-4575) Documentation for the pipeline features


[ 
https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233548#comment-14233548
 ] 

Apache Spark commented on SPARK-4575:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3588

 Documentation for the pipeline features
 ---

 Key: SPARK-4575
 URL: https://issues.apache.org/jira/browse/SPARK-4575
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML, MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley

 Add user guide for the newly added ML pipeline feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-12-03 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233568#comment-14233568
 ] 

Nicholas Chammas commented on SPARK-3431:
-

[~joshrosen] I tried [that patch you posted earlier 
here|https://issues.apache.org/jira/browse/SPARK-3431?focusedCommentId=14168038page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14168038].
 It appears to fork a JVM for every individual test (e.g. 
{{org.apache.spark.streaming.DurationSuite}}). When I tried it out on Jenkins, 
the tests [timed out after 2 
hours|https://github.com/apache/spark/pull/3564#issuecomment-65349149].

 Parallelize execution of tests
 --

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4730) Warn against deprecated YARN settings

Andrew Or created SPARK-4730:


 Summary: Warn against deprecated YARN settings
 Key: SPARK-4730
 URL: https://issues.apache.org/jira/browse/SPARK-4730
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


Yarn currently reads from SPARK_MASTER_MEMORY and SPARK_WORKER_MEMORY. If you 
have these settings leftover from a standalone cluster setup and you try to run 
Spark on Yarn on the same cluster, then your executors suddenly get the amount 
of memory specified through SPARK_WORKER_MEMORY.

This behavior is due in large part to backward compatibility. However, we 
should log a warning against the use of these variables at the very least.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4730) Warn against deprecated YARN settings


[ 
https://issues.apache.org/jira/browse/SPARK-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233596#comment-14233596
 ] 

Apache Spark commented on SPARK-4730:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3590

 Warn against deprecated YARN settings
 -

 Key: SPARK-4730
 URL: https://issues.apache.org/jira/browse/SPARK-4730
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or

 Yarn currently reads from SPARK_MASTER_MEMORY and SPARK_WORKER_MEMORY. If you 
 have these settings leftover from a standalone cluster setup and you try to 
 run Spark on Yarn on the same cluster, then your executors suddenly get the 
 amount of memory specified through SPARK_WORKER_MEMORY.
 This behavior is due in large part to backward compatibility. However, we 
 should log a warning against the use of these variables at the very least.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


 [ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4552.
-
Resolution: Fixed

Issue resolved by pull request 3586
[https://github.com/apache/spark/pull/3586]

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters

2014-12-03 Thread Jey Kottalam (JIRA)

Jey Kottalam created SPARK-4731:
---

 Summary: Spark 1.1.1 launches broken EC2 clusters
 Key: SPARK-4731
 URL: https://issues.apache.org/jira/browse/SPARK-4731
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 on MacOS X
Reporter: Jey Kottalam


EC2 clusters launched using Spark 1.1.1 with the `-v 1.1.1` flag fail to 
initialize the master and workers correctly. The `/root/spark` directory 
contains only the `conf` directory and doesn't have the `bin` and other 
directories. 

[~joshrosen] suggested that [spark-ec2 
#81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
still see this problem after that was merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters

2014-12-03 Thread Jey Kottalam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jey Kottalam updated SPARK-4731:

Description: 
EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 
1.1.1` flag fail to initialize the master and workers correctly. The 
`/root/spark` directory contains only the `conf` directory and doesn't have the 
`bin` and other directories. 

[~joshrosen] suggested that [spark-ec2 
#81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
still see this problem after that was merged.

  was:
EC2 clusters launched using Spark 1.1.1 with the `-v 1.1.1` flag fail to 
initialize the master and workers correctly. The `/root/spark` directory 
contains only the `conf` directory and doesn't have the `bin` and other 
directories. 

[~joshrosen] suggested that [spark-ec2 
#81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
still see this problem after that was merged.


 Spark 1.1.1 launches broken EC2 clusters
 

 Key: SPARK-4731
 URL: https://issues.apache.org/jira/browse/SPARK-4731
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 on MacOS X
Reporter: Jey Kottalam

 EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 
 1.1.1` flag fail to initialize the master and workers correctly. The 
 `/root/spark` directory contains only the `conf` directory and doesn't have 
 the `bin` and other directories. 
 [~joshrosen] suggested that [spark-ec2 
 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I 
 still see this problem after that was merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4498.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3550
[https://github.com/apache/spark/pull/3550]

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Fix For: 1.2.0

 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main

[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows


[ 
https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233656#comment-14233656
 ] 

Apache Spark commented on SPARK-2188:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/3591

 Support sbt/sbt for Windows
 ---

 Key: SPARK-2188
 URL: https://issues.apache.org/jira/browse/SPARK-2188
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 Add the equivalent of sbt/sbt for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows

2014-12-03 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233670#comment-14233670
 ] 

Masayoshi TSUZUKI commented on SPARK-2188:
--

I implemented the equivalent scripts of sbt for Windows with PowerShell.
If you are using Windows, you can just try it.

 Support sbt/sbt for Windows
 ---

 Key: SPARK-2188
 URL: https://issues.apache.org/jira/browse/SPARK-2188
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 Add the equivalent of sbt/sbt for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node

2014-12-03 Thread Harry Brundage (JIRA)

Harry Brundage created SPARK-4732:
-

 Summary: All application progress on the standalone scheduler can 
be halted by one systematically faulty node
 Key: SPARK-4732
 URL: https://issues.apache.org/jira/browse/SPARK-4732
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
 Environment:  - Spark Standalone scheduler
Reporter: Harry Brundage


We've experienced several cluster wide outages caused by unexpected system wide 
faults on one of our spark workers if that worker is failing systematically. By 
systematically, I mean that every executor launched by that worker will 
definitely fail due to some reason out of Spark's control like the log 
directory disk being completely out of space, or a permissions error for a file 
that's always read during executor launch. We screw up all the time on our team 
and cause stuff like this to happen, but because of the way the standalone 
scheduler allocates resources, our cluster doesn't recover gracefully from 
these failures. 

Correct me if I am wrong but when there are more tasks to do than executors, I 
am pretty sure the way the scheduler works is that it just waits for more 
resource offers and then allocates tasks from the queue to those resources. If 
an executor dies immediately after starting, the worker monitor process will 
notice that it's dead. The master will allocate that worker's now free 
cores/memory to a currently running application that is below its 
spark.cores.max, which in our case I've observed as usually the app that just 
had the executor die. A new executor gets spawned on the same worker that the 
last one just died on, gets allocated that one task that failed, and then the 
whole process fails again for the same systematic reason, and lather rinse 
repeat. This happens 10 times or whatever the max task failure count is, and 
then the whole app is deemed a failure by the driver and shut down completely.

For us, we usually run roughly as many cores as we have hadoop nodes. We also 
usually have many more input splits than we have tasks, which means the 
locality of the first few tasks which I believe determines where our executors 
run is well spread out over the cluster, and often covers 90-100% of nodes. 
This means the likelihood of any application getting an executor scheduled any 
broken node is quite high. So, in my experience, after an application goes 
through the above mentioned process and dies, the next application to start or 
not be at it's requested max capacity gets an executor scheduled on the broken 
node, and is promptly taken down as well. This happens over and over as well, 
to the point where none of our spark jobs are making any progress because of 
one tiny permissions mistake on one node.

Now, I totally understand this is usually an error between keyboard and 
screen kind of situation where it is the responsibility of the people 
deploying spark to ensure it is deployed correctly. The systematic issues we've 
encountered are almost always of this nature: permissions errors, disk full 
errors, one node not getting a new spark jar from a configuration error, 
configurations being out of sync, etc. That said, disks are going to fail or 
half fail, fill up, node rot is going to ruin configurations, etc etc etc, and 
as hadoop clusters scale in size this becomes more and more likely, so I think 
its reasonable to ask that Spark be resilient to this kind of failure and keep 
on truckin'. 

I think a good simple fix would be to have applications, or the master, 
blacklist workers (not executors) at a failure count lower than the task 
failure count. This would also serve as a belt and suspenders fix for 
SPARK-4498.
 If the scheduler stopped trying to schedule on nodes that fail a lot, we could 
still make progress. These blacklist events are really important and I think 
would need to be well logged and surfaced in the UI, but I'd rather log and 
carry on than fail hard. I think the tradeoff here is that you risk 
blacklisting ever worker as well if there is something systematically wrong 
with communication or whatever else I can't imagine.

Please let me know if I've misunderstood how the scheduler works or you need 
more information or anything like that and I'll be happy to provide. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node

2014-12-03 Thread Harry Brundage (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Harry Brundage updated SPARK-4732:
--
Description:
We've experienced several cluster wide outages caused by unexpected system wide
faults on one of our spark workers if that worker is failing systematically. By
systematically, I mean that every executor launched by that worker will
definitely fail due to some reason out of Spark's control like the log
directory disk being completely out of space, or a permissions error for a file
that's always read during executor launch. We screw up all the time on our team
and cause stuff like this to happen, but because of the way the standalone
scheduler allocates resources, our cluster doesn't recover gracefully from
these failures.

When there are more tasks to do than executors, I am pretty sure the way the
scheduler works is that it just waits for more resource offers and then
allocates tasks from the queue to those resources. If an executor dies
immediately after starting, the worker monitor process will notice that it's
dead. The master will allocate that worker's now free cores/memory to a
currently running application that is below its spark.cores.max, which in our
case I've observed as usually the app that just had the executor die. A new
executor gets spawned on the same worker that the last one just died on, gets
allocated that one task that failed, and then the whole process fails again for
the same systematic reason, and lather rinse repeat. This happens 10 times or
whatever the max task failure count is, and then the whole app is deemed a
failure by the driver and shut down completely.

For us, we usually run roughly as many cores as we have hadoop nodes. We also
usually have many more input splits than we have tasks, which means the
locality of the first few tasks which I believe determines where our executors
run is well spread out over the cluster, and often covers 90-100% of nodes.
This means the likelihood of any application getting an executor scheduled any
broken node is quite high. So, in my experience, after an application goes
through the above mentioned process and dies, the next application to start or
not be at it's requested max capacity gets an executor scheduled on the broken
node, and is promptly taken down as well. This happens over and over as well,
to the point where none of our spark jobs are making any progress because of
one tiny permissions mistake on one node.

Now, I totally understand this is usually an error between keyboard and
screen kind of situation where it is the responsibility of the people
deploying spark to ensure it is deployed correctly. The systematic issues we've
encountered are almost always of this nature: permissions errors, disk full
errors, one node not getting a new spark jar from a configuration error,
configurations being out of sync, etc. That said, disks are going to fail or
half fail, fill up, node rot is going to ruin configurations, etc etc etc, and
as hadoop clusters scale in size this becomes more and more likely, so I think
its reasonable to ask that Spark be resilient to this kind of failure and keep
on truckin'.

I think a good simple fix would be to have applications, or the master,
blacklist workers (not executors) at a failure count lower than the task
failure count. This would also serve as a belt and suspenders fix for
SPARK-4498.
If the scheduler stopped trying to schedule on nodes that fail a lot, we could
still make progress. These blacklist events are really important and I think
would need to be well logged and surfaced in the UI, but I'd rather log and
carry on than fail hard. I think the tradeoff here is that you risk
blacklisting ever worker as well if there is something systematically wrong
with communication or whatever else I can't imagine.

Please let me know if I've misunderstood how the scheduler works or you need
more information or anything like that and I'll be happy to provide.

was:
We've experienced several cluster wide outages caused by unexpected system wide
faults on one of our spark workers if that worker is failing systematically. By
systematically, I mean that every executor launched by that worker will
definitely fail due to some reason out of Spark's control like the log
directory disk being completely out of space, or a permissions error for a file
that's always read during executor launch. We screw up all the time on our team
and cause stuff like this to happen, but because of the way the standalone
scheduler allocates resources, our cluster doesn't recover gracefully from
these failures.

Correct me if I am wrong but when there are more tasks to do than executors, I
am pretty sure the way the scheduler works is that it just waits for more
resource offers and then allocates tasks from the queue to those resources. If
an executor dies

[jira] [Updated] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished

2014-12-03 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-874:
--
Fix Version/s: (was: 1.2.0)

 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Assignee: Archit Thakur
Priority: Minor
  Labels: starter

 When running benchmarking jobs, sometimes the cluster takes a long time to 
 shut down. We should add a feature where it will ssh into all the workers 
 every few seconds and check that the processes are dead, and won't return 
 until they are all dead. This would help a lot with automating benchmarking 
 scripts.
 There is some equivalent logic here written in python, we just need to add it 
 to the shell script:
 https://github.com/pwendell/spark-perf/blob/master/bin/run#L117



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node

2014-12-03 Thread Harry Brundage (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

This happens to us for all applications in the cluster as well. We usually run
roughly as many cores as we have hadoop nodes. We also usually have many more
input splits than we have tasks, which means the locality of the first few
tasks which I believe determines where our executors run is well spread out
over the cluster, and often covers 90-100% of nodes. This means the likelihood
of any application getting an executor scheduled any broken node is quite high.
After an old application goes through the above mentioned process and dies, the
next application to start or not be at it's requested max capacity gets an
executor scheduled on the broken node, and is promptly taken down as well. This
happens over and over as well, to the point where none of our spark jobs are
making any progress because of one tiny permissions mistake on one node.

Please let me know if I've misunderstood how the scheduler works or you need
more information or anything like that and I'll be happy to provide.

[jira] [Resolved] (SPARK-4085) Job will fail if a shuffle file that's read locally gets deleted

2014-12-03 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4085.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Job will fail if a shuffle file that's read locally gets deleted
 

 Key: SPARK-4085
 URL: https://issues.apache.org/jira/browse/SPARK-4085
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 This commit: 
 https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a
  changed the behavior of fetching local shuffle blocks such that if a shuffle 
 block is not found locally, the shuffle block is no longer marked as failed, 
 and a fetch failed exception is not thrown (this is because the catch block 
 here won't ever be invoked: 
 https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-e6e1631fa01e17bf851f49d30d028823R202
  because the exception called from getLocalFromDisk() doesn't get thrown 
 until next() gets called on the iterator).
 [~rxin] [~matei] it looks like you guys changed the test for this to catch 
 the new exception that gets thrown 
 (https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-9c2e1918319de967045d04caf813a7d1R93).
   Was that intentional?  Because the new exception is a SparkException and 
 not a FetchFailedException, jobs with missing local shuffle data will now 
 fail, rather than having the map stage get retried.
 This problem is reproducible with this test case:
 {code}
   test(hash shuffle manager recovers when local shuffle files get deleted) {
 val conf = new SparkConf(false)
 conf.set(spark.shuffle.manager, hash)
 sc = new SparkContext(local, test, conf)
 val rdd = sc.parallelize(1 to 10, 2).map((_, 1)).reduceByKey(_+_)
 rdd.count()
 // Delete one of the local shuffle blocks.
 sc.env.blockManager.diskBlockManager.getFile(new ShuffleBlockId(0, 0, 
 0)).delete()
 rdd.count()
   }
 {code}
 which will fail on the second rdd.count().
 This is a regression from 1.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer


 [ 
https://issues.apache.org/jira/browse/SPARK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4711.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3569
[https://github.com/apache/spark/pull/3569]

 MLlib optimization: docs should suggest how to choose optimizer
 ---

 Key: SPARK-4711
 URL: https://issues.apache.org/jira/browse/SPARK-4711
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.2.0


 I have heard requests for the docs to include advice about choosing an 
 optimization method.  The programming guide could include a brief statement 
 about this (so the user does not have to read the whole optimization section).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer