[jira] [Created] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Zhang, Liye (JIRA)
Zhang, Liye created SPARK-4933:
--

 Summary: eventLog file not found after merging into a single file
 Key: SPARK-4933
 URL: https://issues.apache.org/jira/browse/SPARK-4933
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Zhang, Liye


enent log file not found exception will be thrown after merging eventLog into a 
single file. Main course is the wrong arguments for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4933:
---
Description: enent log file not found exception will be thrown after making 
eventLog into a single file. Main course is the wrong arguments for getting log 
file.  (was: enent log file not found exception will be thrown after merging 
eventLog into a single file. Main course is the wrong arguments for getting log 
file.)

> eventLog file not found after merging into a single file
> 
>
> Key: SPARK-4933
> URL: https://issues.apache.org/jira/browse/SPARK-4933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>
> enent log file not found exception will be thrown after making eventLog into 
> a single file. Main course is the wrong arguments for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4933:
---
Description: enent log file not found exception will be thrown after making 
eventLog into a single file. Main cause is the wrong arguments for getting log 
file.  (was: enent log file not found exception will be thrown after making 
eventLog into a single file. Main course is the wrong arguments for getting log 
file.)

> eventLog file not found after merging into a single file
> 
>
> Key: SPARK-4933
> URL: https://issues.apache.org/jira/browse/SPARK-4933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>
> enent log file not found exception will be thrown after making eventLog into 
> a single file. Main cause is the wrong arguments for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4933:
---
Description: enent log file not found exception will be thrown after making 
eventLog into a single file. Main cause is the wrong argument for getting log 
file.  (was: enent log file not found exception will be thrown after making 
eventLog into a single file. Main cause is the wrong arguments for getting log 
file.)

> eventLog file not found after merging into a single file
> 
>
> Key: SPARK-4933
> URL: https://issues.apache.org/jira/browse/SPARK-4933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>
> enent log file not found exception will be thrown after making eventLog into 
> a single file. Main cause is the wrong argument for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256736#comment-14256736
 ] 

Apache Spark commented on SPARK-4933:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/3777

> eventLog file not found after merging into a single file
> 
>
> Key: SPARK-4933
> URL: https://issues.apache.org/jira/browse/SPARK-4933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>
> enent log file not found exception will be thrown after making eventLog into 
> a single file. Main cause is the wrong argument for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4934) Connection key is hard to read

2014-12-23 Thread Hong Shen (JIRA)
Hong Shen created SPARK-4934:


 Summary: Connection key is hard to read
 Key: SPARK-4934
 URL: https://issues.apache.org/jira/browse/SPARK-4934
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Hong Shen


When I run a big spark job, executor have a lot of log,
14/12/23 15:25:31 INFO network.ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@52b0e278
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:310)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
It's hard to know which connection is cancelled. maby we can change to 
logInfo("Connection already cancelled ? " + con.getRemoteAddress(), e)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4926) Spark manipulate Hbase

2014-12-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256747#comment-14256747
 ] 

Sean Owen commented on SPARK-4926:
--

I think this is a question than an issue report, and JIRA is for suggesting 
changes to the code. Questions are best asked at u...@spark.apache.org

Here, you are trying to make an RDD of an unserializable type but then trying 
to copy it across the network to the driver with take(). This won't work. You 
need to operate on a different type.

> Spark manipulate Hbase
> --
>
> Key: SPARK-4926
> URL: https://issues.apache.org/jira/browse/SPARK-4926
> Project: Spark
>  Issue Type: Question
>Reporter: Lily
>
> When I run the program below,I got an error “Job aborted due to stage 
> failure: Task 0.0 in stage 2.0 (TID 14) had a not serializable 
> result:org.apache.hadoop.hbase.io.ImmutableBytesWritable”
> How can I manipulate the results?
> How to realize get,put,scan of hbase by scala?
> There are not any examples in the source code files.
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.spark._
> object HbaseTest extends Serializable{
>   def main(args: Array[String]) {
> val sparkConf = new SparkConf().setAppName("HBaseTest")
> val sc = new SparkContext(sparkConf)
> val conf = HBaseConfiguration.create()
> conf.set("hbase.zookeeper.property.clientPort", "2181");
> conf.set("hbase.zookeeper.quorum", "192.168.179.146");
> conf.set(TableInputFormat.INPUT_TABLE, "sensteer_rawdata")
> val admin = new HBaseAdmin(conf)
> if (!admin.isTableAvailable("sensteer_rawdata")) {
>   val tableDesc = new HTableDescriptor("sensteer_rawdata")
>   admin.createTable(tableDesc)
> }
> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
>   classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>   classOf[org.apache.hadoop.hbase.client.Result])
> val count = hBaseRDD.count()
> println("--" + hBaseRDD.count() + "--")
> val res = hBaseRDD.take(count.toInt)
> sc.stop()
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4935) When hive.cli.print.header configured, spark-sql aborted if passed in a invalid sql

2014-12-23 Thread wangfei (JIRA)
wangfei created SPARK-4935:
--

 Summary: When hive.cli.print.header configured, spark-sql aborted 
if passed in a invalid sql
 Key: SPARK-4935
 URL: https://issues.apache.org/jira/browse/SPARK-4935
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: wangfei
 Fix For: 1.3.0


When hive.cli.print.header configured, spark-sql aborted if passed in a invalid 
sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4935) When hive.cli.print.header configured, spark-sql aborted if passed in a invalid sql

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256811#comment-14256811
 ] 

Apache Spark commented on SPARK-4935:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3761

> When hive.cli.print.header configured, spark-sql aborted if passed in a 
> invalid sql
> ---
>
> Key: SPARK-4935
> URL: https://issues.apache.org/jira/browse/SPARK-4935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.2.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> When hive.cli.print.header configured, spark-sql aborted if passed in a 
> invalid sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"

2014-12-23 Thread Joseph Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256852#comment-14256852
 ] 

Joseph Tang commented on SPARK-4846:


It sounds accomplishable.

I'll try this and make a PR later if it works pretty well .

> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>Reporter: Joseph Tang
>Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4936) Please support Named Vector so as to maintain the record ID in clustering etc.

2014-12-23 Thread mahesh bhole (JIRA)
mahesh bhole created SPARK-4936:
---

 Summary: Please support Named Vector so as to maintain the record 
ID in clustering etc.
 Key: SPARK-4936
 URL: https://issues.apache.org/jira/browse/SPARK-4936
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.1
Reporter: mahesh bhole
Priority: Minor


Hi
Please support Named Vector so as to maintain the record ID in clustering etc.

Thanks,
Mahesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4937) Adding optimization to simplify the filter condition

2014-12-23 Thread wangfei (JIRA)
wangfei created SPARK-4937:
--

 Summary: Adding optimization to simplify the filter condition
 Key: SPARK-4937
 URL: https://issues.apache.org/jira/browse/SPARK-4937
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


Adding optimization to simplify the filter condition:
1  condition that can get the boolean result such as:
a < 3 && a > 5   False
a < 1 || a > 0 True

2 Simplify And, Or condition, such as the sql (one of hive-testbench
):
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#35'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 15 and l_quantity <= 15 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 26 and l_quantity <= 26 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
 Before optimized it is a CartesianProduct, in my locally test this sql hang 
and can not get result, after optimization the CartesianProduct replaced by 
ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4938) Adding optimization to simplify the filter condition

2014-12-23 Thread wangfei (JIRA)
wangfei created SPARK-4938:
--

 Summary: Adding optimization to simplify the filter condition
 Key: SPARK-4938
 URL: https://issues.apache.org/jira/browse/SPARK-4938
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


Adding optimization to simplify the filter condition:
1  condition that can get the boolean result such as:
a < 3 && a > 5   False
a < 1 || a > 0 True

2 Simplify And, Or condition, such as the sql (one of hive-testbench
):
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#35'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 15 and l_quantity <= 15 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 26 and l_quantity <= 26 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
 Before optimized it is a CartesianProduct, in my locally test this sql hang 
and can not get result, after optimization the CartesianProduct replaced by 
ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4938) Adding optimization to simplify the filter condition

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256964#comment-14256964
 ] 

Apache Spark commented on SPARK-4938:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3778

> Adding optimization to simplify the filter condition
> 
>
> Key: SPARK-4938
> URL: https://issues.apache.org/jira/browse/SPARK-4938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding optimization to simplify the filter condition:
> 1  condition that can get the boolean result such as:
> a < 3 && a > 5   False
> a < 1 || a > 0 True
> 2 Simplify And, Or condition, such as the sql (one of hive-testbench
> ):
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from
> lineitem,
> part
> where
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#32'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 7 and l_quantity <= 7 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#35'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 15 and l_quantity <= 15 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#24'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 26 and l_quantity <= 26 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> );
>  Before optimized it is a CartesianProduct, in my locally test this sql hang 
> and can not get result, after optimization the CartesianProduct replaced by 
> ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4936) Please support Named Vector so as to maintain the record ID in clustering etc.

2014-12-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256985#comment-14256985
 ] 

Sean Owen commented on SPARK-4936:
--

Are you referring to the NamedVector idea from Mahout? I think that already 
exists in a different form here.

If you have an RDD of (,Vector), then you can already use a 
clustering model to map all the values to a predicted cluster with mapValues(), 
and end up with an RDD of (, ). 

If that's what you're looking for, then it does not require further work in 
Spark.

> Please support Named Vector so as to maintain the record ID in clustering etc.
> --
>
> Key: SPARK-4936
> URL: https://issues.apache.org/jira/browse/SPARK-4936
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.1
>Reporter: mahesh bhole
>Priority: Minor
>
> Hi
> Please support Named Vector so as to maintain the record ID in clustering etc.
> Thanks,
> Mahesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4937) Adding optimization to simplify the filter condition

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257037#comment-14257037
 ] 

Apache Spark commented on SPARK-4937:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3778

> Adding optimization to simplify the filter condition
> 
>
> Key: SPARK-4937
> URL: https://issues.apache.org/jira/browse/SPARK-4937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding optimization to simplify the filter condition:
> 1  condition that can get the boolean result such as:
> a < 3 && a > 5   False
> a < 1 || a > 0 True
> 2 Simplify And, Or condition, such as the sql (one of hive-testbench
> ):
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from
> lineitem,
> part
> where
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#32'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 7 and l_quantity <= 7 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#35'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 15 and l_quantity <= 15 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#24'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 26 and l_quantity <= 26 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> );
>  Before optimized it is a CartesianProduct, in my locally test this sql hang 
> and can not get result, after optimization the CartesianProduct replaced by 
> ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4938) Adding optimization to simplify the filter condition

2014-12-23 Thread wangfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257042#comment-14257042
 ] 

wangfei commented on SPARK-4938:


Duplicate

> Adding optimization to simplify the filter condition
> 
>
> Key: SPARK-4938
> URL: https://issues.apache.org/jira/browse/SPARK-4938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding optimization to simplify the filter condition:
> 1  condition that can get the boolean result such as:
> a < 3 && a > 5   False
> a < 1 || a > 0 True
> 2 Simplify And, Or condition, such as the sql (one of hive-testbench
> ):
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from
> lineitem,
> part
> where
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#32'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 7 and l_quantity <= 7 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#35'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 15 and l_quantity <= 15 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#24'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 26 and l_quantity <= 26 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> );
>  Before optimized it is a CartesianProduct, in my locally test this sql hang 
> and can not get result, after optimization the CartesianProduct replaced by 
> ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4938) Adding optimization to simplify the filter condition

2014-12-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4938.
--
  Resolution: Duplicate
   Fix Version/s: (was: 1.3.0)
Target Version/s:   (was: 1.3.0)

> Adding optimization to simplify the filter condition
> 
>
> Key: SPARK-4938
> URL: https://issues.apache.org/jira/browse/SPARK-4938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> Adding optimization to simplify the filter condition:
> 1  condition that can get the boolean result such as:
> a < 3 && a > 5   False
> a < 1 || a > 0 True
> 2 Simplify And, Or condition, such as the sql (one of hive-testbench
> ):
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from
> lineitem,
> part
> where
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#32'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 7 and l_quantity <= 7 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#35'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 15 and l_quantity <= 15 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#24'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 26 and l_quantity <= 26 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> );
>  Before optimized it is a CartesianProduct, in my locally test this sql hang 
> and can not get result, after optimization the CartesianProduct replaced by 
> ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4585) Spark dynamic scaling executors use upper limit value as default.

2014-12-23 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257069#comment-14257069
 ] 

Brock Noland commented on SPARK-4585:
-

[~chengxiang li] has done some testing of HoS on YARN and discusses this in the 
last half of his post here: 
https://issues.apache.org/jira/browse/HIVE-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256960#comment-14256960

> Spark dynamic scaling executors use upper limit value as default.
> -
>
> Key: SPARK-4585
> URL: https://issues.apache.org/jira/browse/SPARK-4585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Chengxiang Li
>
> With SPARK-3174, one can configure a minimum and maximum number of executors 
> for a Spark application on Yarn. However, the application always starts with 
> the maximum. It seems more reasonable, at least for Hive on Spark, to start 
> from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2014-12-23 Thread Iljya Kalai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257116#comment-14257116
 ] 

Iljya Kalai commented on SPARK-4820:


Thanks for creating this issue, and thanks to Luchesar for providing a 
workaround. Are the ramifications of pulling this into the head repository 
known? It would be great if compiling on encrypted drives just worked. It's a 
bit unfortunate that new Spark users with encrypted drives will get slammed by 
this issue upon trying to compile Spark for the first time.

> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4939) Python updateStateByKey example hang in local mode

2014-12-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-4939:
-

 Summary: Python updateStateByKey example hang in local mode
 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2014-12-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257300#comment-14257300
 ] 

Marcelo Vanzin commented on SPARK-4160:
---

[~gst] if that's the casa it would be a different bug. Please file a separate 
one.

> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2014-12-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-4160:
--
Issue Type: Improvement  (was: Bug)

> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2014-12-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257300#comment-14257300
 ] 

Marcelo Vanzin edited comment on SPARK-4160 at 12/23/14 6:23 PM:
-

[~gst] if that's the case it would be a different bug. Please file a separate 
one.


was (Author: vanzin):
[~gst] if that's the casa it would be a different bug. Please file a separate 
one.

> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4940) Document or Support more evenly distributing cores for Mesos mode

2014-12-23 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-4940:
---

 Summary: Document or Support more evenly distributing cores for 
Mesos mode
 Key: SPARK-4940
 URL: https://issues.apache.org/jira/browse/SPARK-4940
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Currently in Coarse grain mode the spark scheduler simply takes all the 
resources it can on each node, but can cause uneven distribution based on 
resources available on each slave.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2014-12-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257376#comment-14257376
 ] 

Josh Rosen commented on SPARK-4325:
---

[~nchammas] - Yeah, I usually try for a one-to-one match between PRs and JIRAs 
since it makes it easier to track where PRs have been merged, where backports 
are needed, etc.  It's fine to re-open this until those other features are 
added.  You could also add them as subtasks to this issue.

> Improve spark-ec2 cluster launch times
> --
>
> Key: SPARK-4325
> URL: https://issues.apache.org/jira/browse/SPARK-4325
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.3.0
>
>
> There are several optimizations we know we can make to [{{setup.sh}} | 
> https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
> faster.
> There are also some improvements to the AMIs that will help a lot.
> Potential improvements:
> * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
> will reduce or eliminate SSH wait time and Ganglia init time.
> * Replace instances of {{download; rsync to rest of cluster}} with parallel 
> downloads on all nodes of the cluster.
> * Replace instances of 
>  {code}
> for node in $NODES; do
>   command
>   sleep 0.3
> done
> wait{code}
>  with simpler calls to {{pssh}}.
> * Remove the [linear backoff | 
> https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
>  when we wait for SSH availability now that we are already waiting for EC2 
> status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4241) spark_ec2.py support China AWS region: cn-north-1

2014-12-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257374#comment-14257374
 ] 

Josh Rosen commented on SPARK-4241:
---

[~nchammas] I linked SPARK-4890 as a blocker for this issue due to [comment 
upthread|https://issues.apache.org/jira/browse/SPARK-4241?focusedCommentId=14197803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197803]:

{quote}
In order to see region: cn-north-1 , you will have to upgrade boto to the 
latest version.
{quote}

That fix doesn't enable Spark EC2 for {{cn-north-1}}, but I think the newer 
boto version was a prerequisite for supporting that, hence the "blocks" link.

> spark_ec2.py support China AWS region: cn-north-1
> -
>
> Key: SPARK-4241
> URL: https://issues.apache.org/jira/browse/SPARK-4241
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Haitao Yao
>
> Amazon started a new region in China: cn-north-1. But in 
> https://github.com/mesos/spark-ec2/tree/v4/ami-list
> there's no ami id for the region: cn-north-1. so the ec2/spark_ec2.py failed 
> on this step. 
> We need to add ami id for region: cn-north-1 in 
> https://github.com/mesos/spark-ec2/tree/v4/ami-list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4931) Fix the messy format about log4j in running-on-yarn.md

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4931:
--
Assignee: Shixiong Zhu

> Fix the messy format about log4j in running-on-yarn.md
> --
>
> Key: SPARK-4931
> URL: https://issues.apache.org/jira/browse/SPARK-4931
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
> Attachments: log4j.png
>
>
> The format about log4j in running-on-yarn.md is a bit messy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4931) Fix the messy format about log4j in running-on-yarn.md

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4931.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> Fix the messy format about log4j in running-on-yarn.md
> --
>
> Key: SPARK-4931
> URL: https://issues.apache.org/jira/browse/SPARK-4931
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
> Attachments: log4j.png
>
>
> The format about log4j in running-on-yarn.md is a bit messy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4160) Standalone cluster mode does not upload all needed jars to driver node

2014-12-23 Thread Gurpreet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257394#comment-14257394
 ] 

Gurpreet Singh commented on SPARK-4160:
---

Hi Marcelo,

Will open a separate JIRA for the issue.

Thanks!

Gurpreet

On Tue, Dec 23, 2014 at 10:24 AM, Marcelo Vanzin (JIRA) 



> Standalone cluster mode does not upload all needed jars to driver node
> --
>
> Key: SPARK-4160
> URL: https://issues.apache.org/jira/browse/SPARK-4160
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>
> If you look at the code in {{DriverRunner.scala}}, there is code to download 
> the main application jar from the launcher node. But that's the only jar 
> that's downloaded - if the driver depends on one of the jars or files 
> specified via {{spark-submit --jars  --files }}, it won't be able 
> to run.
> It should be possible to use the same mechanism to distribute the other files 
> to the driver node, even if that's not the most efficient way of doing it. 
> That way, at least, you don't need any external dependencies to be able to 
> distribute the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)

2014-12-23 Thread Gurpreet Singh (JIRA)
Gurpreet Singh created SPARK-4941:
-

 Summary: Yarn cluster mode does not upload all needed jars to 
driver node (Spark 1.2.0)
 Key: SPARK-4941
 URL: https://issues.apache.org/jira/browse/SPARK-4941
 Project: Spark
  Issue Type: Bug
Reporter: Gurpreet Singh


I am specifying additional jars and config xml file with --jars and --files 
option to be uploaded to driver in the following spark-submit command. However 
they are not getting uploaded.

This results in the the job failure. It was working in spark 1.0.2 build.

Spark-Build being used (spark-1.2.0.tgz)




$SPARK_HOME/bin/spark-submit \
--class com.ebay.inc.scala.testScalaXML \
--driver-class-path 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar
 \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
--driver-memory 1G  \
--executor-memory 1G \
/export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar 
/export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \
--queue hdmi-spark \
--jars 
/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\
--files 
/export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml

Spark assembly has been built with Hive, including Datanucleus jars on classpath
14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster 
with 2026 NodeManagers
14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (16384 MB per container)
14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB 
memory including 384 MB overhead
14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for our 
AM
14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container
14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
6623380 for b_incdata_rw on 10.115.201.75:8020
14/12/22 23:00:21 INFO yarn.Client: 

Uploading resource 
file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar
 -> 
hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar
14/12/22 23:00:24 INFO yarn.Client: Uploading resource 
file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar -> 
hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar
14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our 
AM container




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4834) Spark fails to clean up cache / lock files in local dirs

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4834:
--
Assignee: Marcelo Vanzin

> Spark fails to clean up cache / lock files in local dirs
> 
>
> Key: SPARK-4834
> URL: https://issues.apache.org/jira/browse/SPARK-4834
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.2.1
>
>
> This issue was caused by https://github.com/apache/spark/commit/7aacb7bfa.
> That change shares downloaded jar / files among multiple executors running on 
> the same host by using a lock file and a cache file for each file the 
> executor needs to download. The problem is that these lock and cache files 
> are never deleted.
> On Yarn, the app's dir is automatically deleted when the app ends, so no 
> files are left behind. But on standalone, there's no such thing as "the app's 
> dir"; files will end up in "/tmp" or in whatever place the user configure in 
> "SPARK_LOCAL_DIRS", and will eventually start to fill that volume.
> We should add a way to clean up these files. It's not as simple as "hey, just 
> call File.deleteOnExit()!" because we're talking about multiple processes 
> accessing these files, so to maintain the efficiency gains of the original 
> change, the files should only be deleted when the application is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4834) Spark fails to clean up cache / lock files in local dirs

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4834.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 3705
[https://github.com/apache/spark/pull/3705]

> Spark fails to clean up cache / lock files in local dirs
> 
>
> Key: SPARK-4834
> URL: https://issues.apache.org/jira/browse/SPARK-4834
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Fix For: 1.3.0, 1.2.1
>
>
> This issue was caused by https://github.com/apache/spark/commit/7aacb7bfa.
> That change shares downloaded jar / files among multiple executors running on 
> the same host by using a lock file and a cache file for each file the 
> executor needs to download. The problem is that these lock and cache files 
> are never deleted.
> On Yarn, the app's dir is automatically deleted when the app ends, so no 
> files are left behind. But on standalone, there's no such thing as "the app's 
> dir"; files will end up in "/tmp" or in whatever place the user configure in 
> "SPARK_LOCAL_DIRS", and will eventually start to fill that volume.
> We should add a way to clean up these files. It's not as simple as "hey, just 
> call File.deleteOnExit()!" because we're talking about multiple processes 
> accessing these files, so to maintain the efficiency gains of the original 
> change, the files should only be deleted when the application is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257444#comment-14257444
 ] 

Apache Spark commented on SPARK-4939:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3779

> Python updateStateByKey example hang in local mode
> --
>
> Key: SPARK-4939
> URL: https://issues.apache.org/jira/browse/SPARK-4939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Davies Liu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2014-12-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4766:
-
Description: 
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator, where the Estimator params class extends the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.getMaxIter()
{code}


  was:
Currently, in spark.ml, both Transformers and Estimators extend the same Params 
classes.  There should be one Params class for the Transformer and one for the 
Estimator, where the Estimator params class extends the Transformer one.

E.g., it is weird to be able to do:
{code}
val model: LogisticRegressionModel = ...
model.setMaxIter(10)
{code}



> ML Estimator Params should subclass Transformer Params
> --
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4932) Add help comments in Analytics

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4932.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0
 Assignee: Takeshi Yamamuro

Fixed by https://github.com/apache/spark/pull/3775

> Add help comments in Analytics
> --
>
> Key: SPARK-4932
> URL: https://issues.apache.org/jira/browse/SPARK-4932
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
>
> Add help comments for taskType in Analytics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2014-12-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257495#comment-14257495
 ] 

Joseph K. Bradley commented on SPARK-4766:
--

This will require modifying Params.inheritValues so that inheritance is handled 
properly.  (It should check against the child's parameters.)

> ML Estimator Params should subclass Transformer Params
> --
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2014-12-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-4766:


Assignee: Joseph K. Bradley

> ML Estimator Params should subclass Transformer Params
> --
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4942) ML Transformers should allow output cols to be turned on,off

2014-12-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4942:


 Summary: ML Transformers should allow output cols to be turned 
on,off
 Key: SPARK-4942
 URL: https://issues.apache.org/jira/browse/SPARK-4942
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


ML Transformers will eventually output multiple columns (e.g., predicted 
labels, predicted confidences, probabilities, etc.).  These columns should be 
optional.

Benefits:
* more efficient (though Spark SQL may be able to optimize)
* cleaner column namespace if people do not want all output columns

Proposal:
* If a column name parameter (e.g., predictionCol) is an empty string, then do 
not output that column.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4942) ML Transformers should allow output cols to be turned on,off

2014-12-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4942:
-
Description: 
ML Transformers will eventually output multiple columns (e.g., predicted 
labels, predicted confidences, probabilities, etc.).  These columns should be 
optional.

Benefits:
* more efficient (though Spark SQL may be able to optimize)
* cleaner column namespace if people do not want all output columns

Proposal:
* If a column name parameter (e.g., predictionCol) is an empty string, then do 
not output that column.

This will require updating validateAndTransformSchema() to ignore empty output 
column names in addition to updating transform().

  was:
ML Transformers will eventually output multiple columns (e.g., predicted 
labels, predicted confidences, probabilities, etc.).  These columns should be 
optional.

Benefits:
* more efficient (though Spark SQL may be able to optimize)
* cleaner column namespace if people do not want all output columns

Proposal:
* If a column name parameter (e.g., predictionCol) is an empty string, then do 
not output that column.



> ML Transformers should allow output cols to be turned on,off
> 
>
> Key: SPARK-4942
> URL: https://issues.apache.org/jira/browse/SPARK-4942
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> ML Transformers will eventually output multiple columns (e.g., predicted 
> labels, predicted confidences, probabilities, etc.).  These columns should be 
> optional.
> Benefits:
> * more efficient (though Spark SQL may be able to optimize)
> * cleaner column namespace if people do not want all output columns
> Proposal:
> * If a column name parameter (e.g., predictionCol) is an empty string, then 
> do not output that column.
> This will require updating validateAndTransformSchema() to ignore empty 
> output column names in addition to updating transform().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4914) Two sets of datanucleus versions left in lib_managed after running dev/run-tests

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4914:
--
Assignee: Cheng Lian

> Two sets of datanucleus versions left in lib_managed after running 
> dev/run-tests
> 
>
> Key: SPARK-4914
> URL: https://issues.apache.org/jira/browse/SPARK-4914
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> The {{dev/run-tests}} script first does a clean compile with Hive 0.12.0, and 
> then builds assembly jar for unit testing with Hive 0.13.1 *without* 
> cleaning. This left two sets of datanucleus jars under the {{lib_managed}} 
> folder, one set depended by Hive 0.12.0, another by Hive 0.13.1.
> This behavior sometimes messes up class paths and makes {{CliSuite}} and 
> {{HiveThriftServer2Suite}} fail, because these two suites spawn external 
> processes that depends on those datanucleus jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4914) Two sets of datanucleus versions left in lib_managed after running dev/run-tests

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4914.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 3756
[https://github.com/apache/spark/pull/3756]

> Two sets of datanucleus versions left in lib_managed after running 
> dev/run-tests
> 
>
> Key: SPARK-4914
> URL: https://issues.apache.org/jira/browse/SPARK-4914
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> The {{dev/run-tests}} script first does a clean compile with Hive 0.12.0, and 
> then builds assembly jar for unit testing with Hive 0.13.1 *without* 
> cleaning. This left two sets of datanucleus jars under the {{lib_managed}} 
> folder, one set depended by Hive 0.12.0, another by Hive 0.13.1.
> This behavior sometimes messes up class paths and makes {{CliSuite}} and 
> {{HiveThriftServer2Suite}} fail, because these two suites spawn external 
> processes that depends on those datanucleus jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4730) Warn against deprecated YARN settings

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4730.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 3590
[https://github.com/apache/spark/pull/3590]

> Warn against deprecated YARN settings
> -
>
> Key: SPARK-4730
> URL: https://issues.apache.org/jira/browse/SPARK-4730
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0, 1.2.1
>
>
> Yarn currently reads from SPARK_MASTER_MEMORY and SPARK_WORKER_MEMORY. If you 
> have these settings leftover from a standalone cluster setup and you try to 
> run Spark on Yarn on the same cluster, then your executors suddenly get the 
> amount of memory specified through SPARK_WORKER_MEMORY.
> This behavior is due in large part to backward compatibility. However, we 
> should log a warning against the use of these variables at the very least.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4933) eventLog file not found after merging into a single file

2014-12-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-4933.
---
Resolution: Duplicate

> eventLog file not found after merging into a single file
> 
>
> Key: SPARK-4933
> URL: https://issues.apache.org/jira/browse/SPARK-4933
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
>
> enent log file not found exception will be thrown after making eventLog into 
> a single file. Main cause is the wrong argument for getting log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4913) Fix incorrect event log path

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4913:
--
Affects Version/s: 1.3.0

> Fix incorrect event log path
> 
>
> Key: SPARK-4913
> URL: https://issues.apache.org/jira/browse/SPARK-4913
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Liang-Chi Hsieh
>
> SPARK-2261 uses a single file to log events for an app. `eventLogDir` in 
> `ApplicationDescription` is replaced with `eventLogFile`. However, 
> `ApplicationDescription` in `SparkDeploySchedulerBackend` is initialized with 
> `SparkContext`'s `eventLogDir`. It is just the log directory, not the actual 
> log file path. `Master.rebuildSparkUI` can not correctly rebuild a new 
> SparkUI for the app.
> Because the `ApplicationDescription` is remotely registered with `Master` and 
> the app's id is then generated in `Master`, we can not get the app id in 
> advance before registration. So the received description needs to be modified 
> with correct `eventLogFile` value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-12-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257648#comment-14257648
 ] 

Tathagata Das commented on SPARK-4314:
--

Yes, doing -put is the wrong way to upload files to HDFS for testing 
fileStream. The files in the monitored HDFS directory must be moved atomically 
to the directory using rename. 

I also looked around to see how other systems deal with _COPYING_ and I found 
Hadoop MR JIRAs - https://issues.apache.org/jira/browse/MAPREDUCE-5247
They have closed the JIRA because they found the FsShell's behavior for 
creating _COPYING_ should not cause MR to ignore those files as they are 
perfectly visible files. 
Same resolution in https://issues.apache.org/jira/browse/HADOOP-9750. So I am 
closing this JIRA.

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.\_COPYING\_
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.stream

[jira] [Resolved] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-12-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4314.
--
Resolution: Invalid

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.\_COPYING\_
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.

[jira] [Updated] (SPARK-4913) Fix incorrect event log path

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4913:
--
Assignee: Liang-Chi Hsieh

> Fix incorrect event log path
> 
>
> Key: SPARK-4913
> URL: https://issues.apache.org/jira/browse/SPARK-4913
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.3.0
>
>
> SPARK-2261 uses a single file to log events for an app. `eventLogDir` in 
> `ApplicationDescription` is replaced with `eventLogFile`. However, 
> `ApplicationDescription` in `SparkDeploySchedulerBackend` is initialized with 
> `SparkContext`'s `eventLogDir`. It is just the log directory, not the actual 
> log file path. `Master.rebuildSparkUI` can not correctly rebuild a new 
> SparkUI for the app.
> Because the `ApplicationDescription` is remotely registered with `Master` and 
> the app's id is then generated in `Master`, we can not get the app id in 
> advance before registration. So the received description needs to be modified 
> with correct `eventLogFile` value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4913) Fix incorrect event log path

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4913.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3755
[https://github.com/apache/spark/pull/3755]

> Fix incorrect event log path
> 
>
> Key: SPARK-4913
> URL: https://issues.apache.org/jira/browse/SPARK-4913
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Liang-Chi Hsieh
> Fix For: 1.3.0
>
>
> SPARK-2261 uses a single file to log events for an app. `eventLogDir` in 
> `ApplicationDescription` is replaced with `eventLogFile`. However, 
> `ApplicationDescription` in `SparkDeploySchedulerBackend` is initialized with 
> `SparkContext`'s `eventLogDir`. It is just the log directory, not the actual 
> log file path. `Master.rebuildSparkUI` can not correctly rebuild a new 
> SparkUI for the app.
> Because the `ApplicationDescription` is remotely registered with `Master` and 
> the app's id is then generated in `Master`, we can not get the app id in 
> advance before registration. So the received description needs to be modified 
> with correct `eventLogFile` value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4802) ReceiverInfo removal at ReceiverTracker upon deregistering receiver

2014-12-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257668#comment-14257668
 ] 

Tathagata Das commented on SPARK-4802:
--

SPARK-2892 is not a duplicate of this though they have same symptoms, the 
reason are different but related. SPARK-2892 affects only socket receiver and 
is probably because the socket receiver does not stop cleanly. SPARK-4802 
affects all receivers, and prevents the driver from realizing that all 
receivers have been closed.

> ReceiverInfo removal at ReceiverTracker upon deregistering receiver
> ---
>
> Key: SPARK-4802
> URL: https://issues.apache.org/jira/browse/SPARK-4802
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Ilayaperumal Gopinathan
>Priority: Minor
>
> When the streaming receiver is deregistered, the ReceiverTracker doesn't 
> remove the corresponding receiverInfo entry for the receiver.
> When the receiver is stopped at the executor and the ReceiverTrackerActor 
> that processes the 'DeregisterReceiver' message. Shouldn't it remove the 
> receiverInfo entry for that receiver as the receiver is actually deregistered?
> Not sure if there is any specific reason for not removing it.
> Currently, I see this warning if we don't remove it:
> WARN main-EventThread scheduler.ReceiverTracker - All of the receivers have 
> not deregistered, Map(0 -> 
> ReceiverInfo(0,MyReceiver-0,null,false,localhost,Stopped by driver,))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4802) ReceiverInfo removal at ReceiverTracker upon deregistering receiver

2014-12-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4802.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2
   1.3.0

> ReceiverInfo removal at ReceiverTracker upon deregistering receiver
> ---
>
> Key: SPARK-4802
> URL: https://issues.apache.org/jira/browse/SPARK-4802
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Ilayaperumal Gopinathan
>Priority: Minor
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> When the streaming receiver is deregistered, the ReceiverTracker doesn't 
> remove the corresponding receiverInfo entry for the receiver.
> When the receiver is stopped at the executor and the ReceiverTrackerActor 
> that processes the 'DeregisterReceiver' message. Shouldn't it remove the 
> receiverInfo entry for that receiver as the receiver is actually deregistered?
> Not sure if there is any specific reason for not removing it.
> Currently, I see this warning if we don't remove it:
> WARN main-EventThread scheduler.ReceiverTracker - All of the receivers have 
> not deregistered, Map(0 -> 
> ReceiverInfo(0,MyReceiver-0,null,false,localhost,Stopped by driver,))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4671) Streaming block need not to replicate 2 copies when WAL is enabled

2014-12-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4671.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> Streaming block need not to replicate 2 copies when WAL is enabled
> --
>
> Key: SPARK-4671
> URL: https://issues.apache.org/jira/browse/SPARK-4671
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
> Fix For: 1.3.0, 1.2.1
>
>
> Generated streaming blocks should not be replicated to another node when WAL 
> is enabled, since WAL is already fault tolerant, this will hurt the 
> throughput of streaming application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4606.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0
   1.1.2

Issue resolved by pull request 3460
[https://github.com/apache/spark/pull/3460]

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Fix For: 1.1.2, 1.3.0, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4606:
--
Assignee: Marcelo Vanzin

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4606:
--
Affects Version/s: 1.1.1

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257728#comment-14257728
 ] 

Tathagata Das commented on SPARK-4817:
--

I agree with [~srowen] point. 
1. Updating database within a map operation is inherently not a good idea. The 
whole idea of map-reduce model is based on the assumption that the map and 
reduce functions do not have any side-effects, and are idempotent, and updating 
the database using map operation (i) violates that property, and (ii) is not a 
good programming style with RDDs. 
2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good 
idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} 
(which usually does not launch a job) should be usually cheap. 



> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-23 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257728#comment-14257728
 ] 

Tathagata Das edited comment on SPARK-4817 at 12/24/14 12:36 AM:
-

I agree with [~srowen] point. 
1. Updating database within a map operation is inherently not a good idea. The 
whole idea of map-reduce model is based on the assumption that the map and 
reduce functions do not have any side-effects, and are idempotent, and updating 
the database using map operation (i) violates that property, and (ii) is not a 
good programming style with RDDs. 
2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good 
idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} 
(which usually does not launch a job) should be usually cheap. 

Hence I am not convinced that this PR is needed.



was (Author: tdas):
I agree with [~srowen] point. 
1. Updating database within a map operation is inherently not a good idea. The 
whole idea of map-reduce model is based on the assumption that the map and 
reduce functions do not have any side-effects, and are idempotent, and updating 
the database using map operation (i) violates that property, and (ii) is not a 
good programming style with RDDs. 
2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good 
idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} 
(which usually does not launch a job) should be usually cheap. 



> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4943) Parsing error for query with table name having dot

2014-12-23 Thread Alex Liu (JIRA)
Alex Liu created SPARK-4943:
---

 Summary: Parsing error for query with table name having dot
 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu


When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. 
It was working for Spark 1.1.0 version. Basically we use the table name having 
dot to include database name 

{code}
[info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' 
found
[info] 

[info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
test2.a FROM sql_test.test2 AS test2
[info] ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
[info]   at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
[info]   at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
[info]   at 
org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
[info]   at 
org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
[info]   at 
com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
[info]   at 
com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
[info]   at 
com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
[info]   at 
org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
[info]   at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
[info]   at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656)
[info]   at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683)
[info]   at 
org.scalatest.FlatSpec

[jira] [Created] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2014-12-23 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4944:


 Summary: Table Not Found exception in "Create Table Like 
registered RDD table"
 Key: SPARK-4944
 URL: https://issues.apache.org/jira/browse/SPARK-4944
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


{code}
rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
'/user/spark/my_data.parquet'")
{code}

{panel}
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
found rdd_table
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4860) Improve performance of sample() and takeSample() on SchemaRDD

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4860.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3764
[https://github.com/apache/spark/pull/3764]

> Improve performance of sample() and  takeSample() on SchemaRDD
> --
>
> Key: SPARK-4860
> URL: https://issues.apache.org/jira/browse/SPARK-4860
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Davies Liu
> Fix For: 1.3.0
>
>
> In SchemaRDD, all the rows are already serialized into Java objects, so it's 
> possible to call sample()/takeSample() of JavaSchemaRDD() in Python, which 
> will be much faster than the current approach (implemented in pure Python).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4860) Improve performance of sample() and takeSample() on SchemaRDD

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4860:
--
Assignee: Ben Cook

> Improve performance of sample() and  takeSample() on SchemaRDD
> --
>
> Key: SPARK-4860
> URL: https://issues.apache.org/jira/browse/SPARK-4860
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Davies Liu
>Assignee: Ben Cook
> Fix For: 1.3.0
>
>
> In SchemaRDD, all the rows are already serialized into Java objects, so it's 
> possible to call sample()/takeSample() of JavaSchemaRDD() in Python, which 
> will be much faster than the current approach (implemented in pure Python).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4877) userClassPathFirst doesn't handle user classes inheriting from parent

2014-12-23 Thread Stephen Haberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257855#comment-14257855
 ] 

Stephen Haberman commented on SPARK-4877:
-

FWIW two reviewers have okay'd this PR; can someone else take a look + commit 
it?

> userClassPathFirst doesn't handle user classes inheriting from parent
> -
>
> Key: SPARK-4877
> URL: https://issues.apache.org/jira/browse/SPARK-4877
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Stephen Haberman
>
> We're trying out userClassPathFirst.
> To do so, we make an uberjar that does not contain Spark or Scala classes 
> (because we want those to load from the parent classloader, otherwise we'll 
> get errors like scala.Function0 != scala.Function0 since they'd load from 
> different class loaders).
> (Tangentially, some isolation classloaders like Jetty whitelist certain 
> packages, like spark/* and scala/*, to only come from the parent classloader, 
> so that technically if the user still messes up and leaks the Scala/Spark 
> jars into their uberjar, it won't blow up; this would be a good enhancement, 
> I think.)
> Anyway, we have a custom Kryo registrar, which ships in our uberjar, but 
> since it "extends spark.KryoRegistrator", which is not in our uberjar, we get 
> a ClassNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Stephen Haberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257860#comment-14257860
 ] 

Stephen Haberman commented on SPARK-4606:
-

[~vanzin] since you're poking around in this section of the code, can you look 
at SPARK-4704 as well?

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4704) SparkSubmitDriverBootstrap doesn't flush output

2014-12-23 Thread Stephen Haberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257859#comment-14257859
 ] 

Stephen Haberman commented on SPARK-4704:
-

That that PR 3655 is for a separate issue (it referenced this ticket as a typo).

> SparkSubmitDriverBootstrap doesn't flush output
> ---
>
> Key: SPARK-4704
> URL: https://issues.apache.org/jira/browse/SPARK-4704
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: 1.2.0-rc1
>Reporter: Stephen Haberman
>
> When running spark-submit with a job that immediately blows up (say due to 
> init errors in the job code), there is no error output from spark-submit on 
> the console.
> When I ran spark-class directly, then I do see the error/stack trace on the 
> console.
> I believe the issue is in SparkSubmitDriverBootstrapper (I had 
> spark.driver.memory set in spark-defaults.conf) not waiting for the  
> RedirectThreads to flush/complete before exiting.
> E.g. here:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala#L143
> I believe around line 165 or so, stdoutThread.join() and
> stderrThread.join() calls are necessary to make sure the other threads
> have had a chance to flush process.getInputStream/getErrorStream to
> System.out/err before the process exits.
> I've been tripped up by this in similar RedirectThread/process code, hence 
> suspecting this is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4704) SparkSubmitDriverBootstrap doesn't flush output

2014-12-23 Thread Stephen Haberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257859#comment-14257859
 ] 

Stephen Haberman edited comment on SPARK-4704 at 12/24/14 2:43 AM:
---

Note that PR 3655 (from the previous comment) is for a separate issue (it 
referenced this ticket number as a typo).


was (Author: stephen):
That that PR 3655 is for a separate issue (it referenced this ticket as a typo).

> SparkSubmitDriverBootstrap doesn't flush output
> ---
>
> Key: SPARK-4704
> URL: https://issues.apache.org/jira/browse/SPARK-4704
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: 1.2.0-rc1
>Reporter: Stephen Haberman
>
> When running spark-submit with a job that immediately blows up (say due to 
> init errors in the job code), there is no error output from spark-submit on 
> the console.
> When I ran spark-class directly, then I do see the error/stack trace on the 
> console.
> I believe the issue is in SparkSubmitDriverBootstrapper (I had 
> spark.driver.memory set in spark-defaults.conf) not waiting for the  
> RedirectThreads to flush/complete before exiting.
> E.g. here:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala#L143
> I believe around line 165 or so, stdoutThread.join() and
> stderrThread.join() calls are necessary to make sure the other threads
> have had a chance to flush process.getInputStream/getErrorStream to
> System.out/err before the process exits.
> I've been tripped up by this in similar RedirectThread/process code, hence 
> suspecting this is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4704) SparkSubmitDriverBootstrap doesn't flush output

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4704:
--
Comment: was deleted

(was: User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/3655)

> SparkSubmitDriverBootstrap doesn't flush output
> ---
>
> Key: SPARK-4704
> URL: https://issues.apache.org/jira/browse/SPARK-4704
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: 1.2.0-rc1
>Reporter: Stephen Haberman
>
> When running spark-submit with a job that immediately blows up (say due to 
> init errors in the job code), there is no error output from spark-submit on 
> the console.
> When I ran spark-class directly, then I do see the error/stack trace on the 
> console.
> I believe the issue is in SparkSubmitDriverBootstrapper (I had 
> spark.driver.memory set in spark-defaults.conf) not waiting for the  
> RedirectThreads to flush/complete before exiting.
> E.g. here:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala#L143
> I believe around line 165 or so, stdoutThread.join() and
> stderrThread.join() calls are necessary to make sure the other threads
> have had a chance to flush process.getInputStream/getErrorStream to
> System.out/err before the process exits.
> I've been tripped up by this in similar RedirectThread/process code, hence 
> suspecting this is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4704) SparkSubmitDriverBootstrap doesn't flush output

2014-12-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257861#comment-14257861
 ] 

Josh Rosen commented on SPARK-4704:
---

I've removed the link / comment to the unrelated PR.

> SparkSubmitDriverBootstrap doesn't flush output
> ---
>
> Key: SPARK-4704
> URL: https://issues.apache.org/jira/browse/SPARK-4704
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: 1.2.0-rc1
>Reporter: Stephen Haberman
>
> When running spark-submit with a job that immediately blows up (say due to 
> init errors in the job code), there is no error output from spark-submit on 
> the console.
> When I ran spark-class directly, then I do see the error/stack trace on the 
> console.
> I believe the issue is in SparkSubmitDriverBootstrapper (I had 
> spark.driver.memory set in spark-defaults.conf) not waiting for the  
> RedirectThreads to flush/complete before exiting.
> E.g. here:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala#L143
> I believe around line 165 or so, stdoutThread.join() and
> stderrThread.join() calls are necessary to make sure the other threads
> have had a chance to flush process.getInputStream/getErrorStream to
> System.out/err before the process exits.
> I've been tripped up by this in similar RedirectThread/process code, hence 
> suspecting this is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257875#comment-14257875
 ] 

Marcelo Vanzin commented on SPARK-4606:
---

@stephen hmm, I'm working on some other changes that might make that bug 
obsolete. I'd rather concentrate on those since I'm almost finished with a 
working prototype.

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4606) SparkSubmitDriverBootstrapper does not propagate EOF to child JVM

2014-12-23 Thread Stephen Haberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257877#comment-14257877
 ] 

Stephen Haberman commented on SPARK-4606:
-

Cool, that sounds great; thanks, Marcelo.

> SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
> -
>
> Key: SPARK-4606
> URL: https://issues.apache.org/jira/browse/SPARK-4606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> Run this with "spark.driver.extraJavaOptions" set in your spark-defaults.conf:
> {code}
>   echo "" | spark-shell --master local -Xnojline
> {code}
> You'll end up with a child process that cannot read from stdin (you can 
> CTRL-C out of it though). That's because when the bootstrapper's stdin 
> reaches EOF, that is not propagated to the child JVM that's actually doing 
> the reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4945) Add overwrite option support for SchemaRDD.saveAsParquetFile

2014-12-23 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4945:


 Summary: Add overwrite option support for 
SchemaRDD.saveAsParquetFile
 Key: SPARK-4945
 URL: https://issues.apache.org/jira/browse/SPARK-4945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4945) Add overwrite option support for SchemaRDD.saveAsParquetFile

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257885#comment-14257885
 ] 

Apache Spark commented on SPARK-4945:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3780

> Add overwrite option support for SchemaRDD.saveAsParquetFile
> 
>
> Key: SPARK-4945
> URL: https://issues.apache.org/jira/browse/SPARK-4945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4881) Use SparkConf#getBoolean instead of get().toBoolean

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4881:
--
Assignee: Kousuke Saruta

> Use SparkConf#getBoolean instead of get().toBoolean
> ---
>
> Key: SPARK-4881
> URL: https://issues.apache.org/jira/browse/SPARK-4881
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Trivial
> Fix For: 1.3.0
>
>
> It's really a minor issue.
> In ApplicationMaster, there is code like as follows.
> {code}
>   val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", 
> "false").toBoolean
> {code}
> I think, the code can be simplified like as follows.
> {code}
>   val preserveFiles = 
> sparkConf.getBoolean("spark.yarn.preserve.staging.files", false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4881) Use SparkConf#getBoolean instead of get().toBoolean

2014-12-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4881.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3733
[https://github.com/apache/spark/pull/3733]

> Use SparkConf#getBoolean instead of get().toBoolean
> ---
>
> Key: SPARK-4881
> URL: https://issues.apache.org/jira/browse/SPARK-4881
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Priority: Trivial
> Fix For: 1.3.0
>
>
> It's really a minor issue.
> In ApplicationMaster, there is code like as follows.
> {code}
>   val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", 
> "false").toBoolean
> {code}
> I think, the code can be simplified like as follows.
> {code}
>   val preserveFiles = 
> sparkConf.getBoolean("spark.yarn.preserve.staging.files", false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4946) Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem

2014-12-23 Thread YanTang Zhai (JIRA)
YanTang Zhai created SPARK-4946:
---

 Summary: Using AkkaUtils.askWithReply in 
MapOutputTracker.askTracker to reduce the chance of the communicating problem
 Key: SPARK-4946
 URL: https://issues.apache.org/jira/browse/SPARK-4946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor


Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the 
chance of the communicating problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-23 Thread Jongyoul Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257935#comment-14257935
 ] 

Jongyoul Lee commented on SPARK-3619:
-

[~tnachen] I'm trying to test mesos 0.21 in my company cluster using mesos 
internally. If you don't mind, would I fix this upgrade? I hope this patch is 
merged sooner.

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>Assignee: Timothy Chen
>
> The Mesos 0.21 release has a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4947) Use EC2 status checks to know when to test SSH availability

2014-12-23 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-4947:
---

 Summary: Use EC2 status checks to know when to test SSH 
availability
 Key: SPARK-4947
 URL: https://issues.apache.org/jira/browse/SPARK-4947
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4947) Use EC2 status checks to know when to test SSH availability

2014-12-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-4947.
-
Resolution: Fixed

Resolved in [#3195|https://github.com/apache/spark/pull/3195].

> Use EC2 status checks to know when to test SSH availability
> ---
>
> Key: SPARK-4947
> URL: https://issues.apache.org/jira/browse/SPARK-4947
> Project: Spark
>  Issue Type: Sub-task
>  Components: EC2
>Affects Versions: 1.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2014-12-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257939#comment-14257939
 ] 

Nicholas Chammas commented on SPARK-4325:
-

OK, I created a [sub-task|https://issues.apache.org/jira/browse/SPARK-4947] to 
match the work done in [#3195|https://github.com/apache/spark/pull/3195]. Could 
you assign it to me [~joshrosen]?

> Improve spark-ec2 cluster launch times
> --
>
> Key: SPARK-4325
> URL: https://issues.apache.org/jira/browse/SPARK-4325
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.3.0
>
>
> There are several optimizations we know we can make to [{{setup.sh}} | 
> https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
> faster.
> There are also some improvements to the AMIs that will help a lot.
> Potential improvements:
> * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
> will reduce or eliminate SSH wait time and Ganglia init time.
> * Replace instances of {{download; rsync to rest of cluster}} with parallel 
> downloads on all nodes of the cluster.
> * Replace instances of 
>  {code}
> for node in $NODES; do
>   command
>   sleep 0.3
> done
> wait{code}
>  with simpler calls to {{pssh}}.
> * Remove the [linear backoff | 
> https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
>  when we wait for SSH availability now that we are already waiting for EC2 
> status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-23 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257940#comment-14257940
 ] 

Sandy Ryza commented on SPARK-4921:
---

Is there a barebones Spark program that I could use to reproduce this?

> Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
> NO_PREF tasks
> -
>
> Key: SPARK-4921
> URL: https://issues.apache.org/jira/browse/SPARK-4921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Xuefu Zhang
> Attachments: NO_PREF.patch
>
>
> During research for HIVE-9153, we found that TaskSetManager returns 
> PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
> Changing the return value to NO_PREF, as demonstrated in the attached patch, 
> seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations

2014-12-23 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-4948:
---

 Summary: Use pssh instead of bash-isms and remove unnecessary 
operations
 Key: SPARK-4948
 URL: https://issues.apache.org/jira/browse/SPARK-4948
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor


Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary 
SSH calls to pre-approve keys.

Replace bash-isms like {{while ... command ... & wait}} with {{pssh}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4936) Please support Named Vector so as to maintain the record ID in clustering etc.

2014-12-23 Thread mahesh bhole (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257944#comment-14257944
 ] 

mahesh bhole commented on SPARK-4936:
-

Thanks sean..I missed that part..

There is JavaPairRDD I can use.

Thanks for your help.

-- Mahesh

> Please support Named Vector so as to maintain the record ID in clustering etc.
> --
>
> Key: SPARK-4936
> URL: https://issues.apache.org/jira/browse/SPARK-4936
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.1
>Reporter: mahesh bhole
>Priority: Minor
>
> Hi
> Please support Named Vector so as to maintain the record ID in clustering etc.
> Thanks,
> Mahesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4949) shutdownCallback in SparkDeploySchedulerBackend should be enclosed by synchronized block.

2014-12-23 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-4949:
-

 Summary: shutdownCallback in SparkDeploySchedulerBackend should be 
enclosed by synchronized block.
 Key: SPARK-4949
 URL: https://issues.apache.org/jira/browse/SPARK-4949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


A variable `shutdownCallback` in SparkDeploySchedulerBackend can be accessed 
from multiple threads so it should be enclosed by synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4949) shutdownCallback in SparkDeploySchedulerBackend should be enclosed by synchronized block.

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257949#comment-14257949
 ] 

Apache Spark commented on SPARK-4949:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3781

> shutdownCallback in SparkDeploySchedulerBackend should be enclosed by 
> synchronized block.
> -
>
> Key: SPARK-4949
> URL: https://issues.apache.org/jira/browse/SPARK-4949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> A variable `shutdownCallback` in SparkDeploySchedulerBackend can be accessed 
> from multiple threads so it should be enclosed by synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4950) Delete obsolete mapReduceTripelets used in Pregel

2014-12-23 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-4950:
---

 Summary: Delete obsolete mapReduceTripelets used in Pregel
 Key: SPARK-4950
 URL: https://issues.apache.org/jira/browse/SPARK-4950
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
 Environment: Any reason not to replace the api along with SPARK-3936?
Reporter: Takeshi Yamamuro
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4950) Delete obsolete mapReduceTripelets used in Pregel

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257960#comment-14257960
 ] 

Apache Spark commented on SPARK-4950:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3782

> Delete obsolete mapReduceTripelets used in Pregel
> -
>
> Key: SPARK-4950
> URL: https://issues.apache.org/jira/browse/SPARK-4950
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
> Environment: Any reason not to replace the api along with SPARK-3936?
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4950) Delete obsolete mapReduceTripelets used in Pregel

2014-12-23 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-4950:

Environment: (was: Any reason not to replace the api along with 
SPARK-3936?)

> Delete obsolete mapReduceTripelets used in Pregel
> -
>
> Key: SPARK-4950
> URL: https://issues.apache.org/jira/browse/SPARK-4950
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4950) Delete obsolete mapReduceTripelets used in Pregel

2014-12-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257961#comment-14257961
 ] 

Takeshi Yamamuro commented on SPARK-4950:
-

Any reason not to replace the api along with SPARK-3936?

> Delete obsolete mapReduceTripelets used in Pregel
> -
>
> Key: SPARK-4950
> URL: https://issues.apache.org/jira/browse/SPARK-4950
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled

2014-12-23 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-4951:
---

 Summary: A busy executor may be killed when dynamicAllocation is 
enabled
 Key: SPARK-4951
 URL: https://issues.apache.org/jira/browse/SPARK-4951
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the 
executor which runs this task will be killed.

The following steps (yarn-client mode) can reproduce this bug:
1. Start `spark-shell` using
{code}
./bin/spark-shell --conf "spark.shuffle.service.enabled=true" \
--conf "spark.dynamicAllocation.minExecutors=1" \
--conf "spark.dynamicAllocation.maxExecutors=4" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.dynamicAllocation.executorIdleTimeout=30" \
--master yarn-client \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1
{code}

2. Wait more than 30 seconds until there is only one executor.
3. Run the following code (a task needs at least 50 seconds to finish)
{code}
val r = sc.parallelize(1 to 1000, 20).map{t => Thread.sleep(1000); t}.groupBy(_ 
% 2).collect()
{code}
4. Executors will be killed and allocated all the time, which makes the Job 
fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4951) A busy executor may be killed when dynamicAllocation is enabled

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257986#comment-14257986
 ] 

Apache Spark commented on SPARK-4951:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3783

> A busy executor may be killed when dynamicAllocation is enabled
> ---
>
> Key: SPARK-4951
> URL: https://issues.apache.org/jira/browse/SPARK-4951
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> If a task runs more than `spark.dynamicAllocation.executorIdleTimeout`, the 
> executor which runs this task will be killed.
> The following steps (yarn-client mode) can reproduce this bug:
> 1. Start `spark-shell` using
> {code}
> ./bin/spark-shell --conf "spark.shuffle.service.enabled=true" \
> --conf "spark.dynamicAllocation.minExecutors=1" \
> --conf "spark.dynamicAllocation.maxExecutors=4" \
> --conf "spark.dynamicAllocation.enabled=true" \
> --conf "spark.dynamicAllocation.executorIdleTimeout=30" \
> --master yarn-client \
> --driver-memory 512m \
> --executor-memory 512m \
> --executor-cores 1
> {code}
> 2. Wait more than 30 seconds until there is only one executor.
> 3. Run the following code (a task needs at least 50 seconds to finish)
> {code}
> val r = sc.parallelize(1 to 1000, 20).map{t => Thread.sleep(1000); 
> t}.groupBy(_ % 2).collect()
> {code}
> 4. Executors will be killed and allocated all the time, which makes the Job 
> fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4936) Please support Named Vector so as to maintain the record ID in clustering etc.

2014-12-23 Thread mahesh bhole (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh bhole closed SPARK-4936.
---
Resolution: Fixed

RDD of (,Vector) is laready available.
e.g. JavaPairRDD for Java implementation.
Closing the issue.

> Please support Named Vector so as to maintain the record ID in clustering etc.
> --
>
> Key: SPARK-4936
> URL: https://issues.apache.org/jira/browse/SPARK-4936
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.1
>Reporter: mahesh bhole
>Priority: Minor
>
> Hi
> Please support Named Vector so as to maintain the record ID in clustering etc.
> Thanks,
> Mahesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4936) Please support Named Vector so as to maintain the record ID in clustering etc.

2014-12-23 Thread mahesh bhole (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257995#comment-14257995
 ] 

mahesh bhole edited comment on SPARK-4936 at 12/24/14 5:43 AM:
---

RDD of (,Vector) is already available.
e.g. JavaPairRDD for Java implementation.
Closing the issue.


was (Author: search4mahesh):
RDD of (,Vector) is laready available.
e.g. JavaPairRDD for Java implementation.
Closing the issue.

> Please support Named Vector so as to maintain the record ID in clustering etc.
> --
>
> Key: SPARK-4936
> URL: https://issues.apache.org/jira/browse/SPARK-4936
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.1
>Reporter: mahesh bhole
>Priority: Minor
>
> Hi
> Please support Named Vector so as to maintain the record ID in clustering etc.
> Thanks,
> Mahesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4937) Adding optimization to simplify the filter condition

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258001#comment-14258001
 ] 

Apache Spark commented on SPARK-4937:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3784

> Adding optimization to simplify the filter condition
> 
>
> Key: SPARK-4937
> URL: https://issues.apache.org/jira/browse/SPARK-4937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding optimization to simplify the filter condition:
> 1  condition that can get the boolean result such as:
> a < 3 && a > 5   False
> a < 1 || a > 0 True
> 2 Simplify And, Or condition, such as the sql (one of hive-testbench
> ):
> select
> sum(l_extendedprice* (1 - l_discount)) as revenue
> from
> lineitem,
> part
> where
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#32'
> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
> and l_quantity >= 7 and l_quantity <= 7 + 10
> and p_size between 1 and 5
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#35'
> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
> and l_quantity >= 15 and l_quantity <= 15 + 10
> and p_size between 1 and 10
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> )
> or
> (
> p_partkey = l_partkey
> and p_brand = 'Brand#24'
> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
> and l_quantity >= 26 and l_quantity <= 26 + 10
> and p_size between 1 and 15
> and l_shipmode in ('AIR', 'AIR REG')
> and l_shipinstruct = 'DELIVER IN PERSON'
> );
>  Before optimized it is a CartesianProduct, in my locally test this sql hang 
> and can not get result, after optimization the CartesianProduct replaced by 
> ShuffledHashJoin, which only need 20+ seconds to run this sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4946) Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258028#comment-14258028
 ] 

Apache Spark commented on SPARK-4946:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3785

> Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the 
> chance of the communicating problem
> -
>
> Key: SPARK-4946
> URL: https://issues.apache.org/jira/browse/SPARK-4946
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the 
> chance of the communicating problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4952) In some cases, spark on yarn failed to start

2014-12-23 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-4952:
--

 Summary: In some cases, spark on yarn failed to start
 Key: SPARK-4952
 URL: https://issues.apache.org/jira/browse/SPARK-4952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Guoqiang Li


the log:
{noformat}
Exception in thread "main" 14/12/24 12:00:25 INFO 
cluster.YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> 
tuan200, PROXY_URI_BASES -> 
http://host:9082/proxy/application_1414231702825_488625), 
/proxy/application_1414231702825_488625
java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.length(Stream.scala:284)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
at 
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1319)
at org.apache.spark.SparkContext.(SparkContext.scala:344)
at com.zhe800.toona.als.computation.DealCF$.main(DealCF.scala:497)
at com.zhe800.toona.als.computation.DealCF.main(DealCF.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/12/24 12:00:25 INFO ui.JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4723) To abort the stages which have attempted some times

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258050#comment-14258050
 ] 

Apache Spark commented on SPARK-4723:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3786

> To abort the stages which have attempted some times
> ---
>
> Key: SPARK-4723
> URL: https://issues.apache.org/jira/browse/SPARK-4723
> Project: Spark
>  Issue Type: Improvement
>Reporter: YanTang Zhai
>Priority: Minor
>
> For some reason, some stages may attempt many times. A threshold could be 
> added and the stages which have attempted more than the threshold could be 
> aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4953) Fix the description of building Spark with YARN

2014-12-23 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-4953:
-

 Summary: Fix the description of building Spark with YARN
 Key: SPARK-4953
 URL: https://issues.apache.org/jira/browse/SPARK-4953
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


At the section "Specifying the Hadoop Version" In building-spark.md, there is 
description about building with YARN with Hadoop 0.23.
Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4953) Fix the description of building Spark with YARN

2014-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258057#comment-14258057
 ] 

Apache Spark commented on SPARK-4953:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3787

> Fix the description of building Spark with YARN
> ---
>
> Key: SPARK-4953
> URL: https://issues.apache.org/jira/browse/SPARK-4953
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> At the section "Specifying the Hadoop Version" In building-spark.md, there is 
> description about building with YARN with Hadoop 0.23.
> Spark 1.3.0 will not support Hadoop 0.23 so we should fix the description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4921) Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks

2014-12-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258074#comment-14258074
 ] 

Rui Li commented on SPARK-4921:
---

Hi [~sandyr], I thought more on this and maybe the check I mentioned earlier is 
enough to avoid resetting {{currentLocalityIndex}}, which means it won't 
downgrade performance. Only, it'll be a little confusing for user to find 
no_pref tasks are printed as process_local in log.

> Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
> NO_PREF tasks
> -
>
> Key: SPARK-4921
> URL: https://issues.apache.org/jira/browse/SPARK-4921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Xuefu Zhang
> Attachments: NO_PREF.patch
>
>
> During research for HIVE-9153, we found that TaskSetManager returns 
> PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
> Changing the return value to NO_PREF, as demonstrated in the attached patch, 
> seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org