[jira] [Created] (SPARK-4337) Add ability to cancel pending requests to YARN

2014-11-10 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4337:
-

 Summary: Add ability to cancel pending requests to YARN
 Key: SPARK-4337
 URL: https://issues.apache.org/jira/browse/SPARK-4337
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza


This will be useful for things like SPARK-4136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206098#comment-14206098
 ] 

Apache Spark commented on SPARK-4214:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3204

> With dynamic allocation, avoid outstanding requests for more executors than 
> pending tasks need
> --
>
> Key: SPARK-4214
> URL: https://issues.apache.org/jira/browse/SPARK-4214
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> Dynamic allocation tries to allocate more executors while we have pending 
> tasks remaining.  Our current policy can end up with more outstanding 
> executor requests than needed to fulfill all the pending tasks.  Capping the 
> executor requests to the number of cores needed to fulfill all pending tasks 
> would make dynamic allocation behavior less sensitive to settings for 
> maxExecutors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-11-10 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206081#comment-14206081
 ] 

Jeff Hammerbacher commented on SPARK-3647:
--

Should this issue have a Fix version set?

> Shaded Guava patch causes access issues with package private classes
> 
>
> Key: SPARK-3647
> URL: https://issues.apache.org/jira/browse/SPARK-3647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
>
> The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
> backwards compatibility in the Java API by not relocating the "Optional" 
> class. That causes problems when that class references package private 
> members in the Absent and Present classes, which are now in a different 
> package:
> {noformat}
> Exception in thread "main" java.lang.IllegalAccessError: tried to access 
> class org.spark-project.guava.common.base.Present from class 
> com.google.common.base.Optional
>   at com.google.common.base.Optional.of(Optional.java:86)
>   at 
> org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
>   at 
> org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-11-10 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206036#comment-14206036
 ] 

Xiangrui Meng commented on SPARK-3990:
--

In Spark 1.1.1, I put a note on ALS and ask users to use the default serializer 
if they want to run ALS.

> kryo.KryoException caused by ALS.trainImplicit in pyspark
> -
>
> Key: SPARK-3990
> URL: https://issues.apache.org/jira/browse/SPARK-3990
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0
> Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2
> Linux
> Python 2.6.8
>Reporter: Gen TANG
>Assignee: Xiangrui Meng
>Priority: Critical
>  Labels: test
> Fix For: 1.1.1, 1.2.0
>
>
> When we tried ALS.trainImplicit() in pyspark environment, it only works for 
> iterations = 1. What is more strange, it is that if we try the same code in 
> Scala, it works very well.(I did several test, by now, in Scala 
> ALS.trainImplicit works)
> For example, the following code:
> {code:title=test.py|borderStyle=solid}
>   from pyspark.mllib.recommendation import *
>   r1 = (1, 1, 1.0) 
>   r2 = (1, 2, 2.0) 
>   r3 = (2, 1, 2.0) 
>   ratings = sc.parallelize([r1, r2, r3]) 
>   model = ALS.trainImplicit(ratings, 1) 
> '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)'''
> {code}
> It will cause the failed stage at count at ALS.scala:314 Info as:
> {code:title=error information provided by ganglia}
> Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
> ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
> java.lang.ArrayStoreException: scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.Thr

[jira] [Commented] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206026#comment-14206026
 ] 

Apache Spark commented on SPARK-4314:
-

User 'maji2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/3203

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>  at scala.collection

[jira] [Commented] (SPARK-4336) auto detect type from json string

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206025#comment-14206025
 ] 

Apache Spark commented on SPARK-4336:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3202

> auto detect type from json string
> -
>
> Key: SPARK-4336
> URL: https://issues.apache.org/jira/browse/SPARK-4336
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4335) Mima check misreporting for GraphX pull request

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206019#comment-14206019
 ] 

Apache Spark commented on SPARK-4335:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/3201

> Mima check misreporting for GraphX pull request
> ---
>
> Key: SPARK-4335
> URL: https://issues.apache.org/jira/browse/SPARK-4335
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Reynold Xin
>Assignee: Prashant Sharma
>Priority: Blocker
>
> Mima is reporting binary check failure for RDD/partitions in the following 
> pull request even though the pull request doesn't change those at all.
> https://github.com/apache/spark/pull/2530
> {code}
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [info] Done updating.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4324) Support numpy/scipy in all Python API of MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4324.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Support numpy/scipy in all Python API of MLlib
> --
>
> Key: SPARK-4324
> URL: https://issues.apache.org/jira/browse/SPARK-4324
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.0
>
>
> Before 1.2, we accept numpy/scipy array as Vector in MLlib Python API. We 
> should continue to support this semantics in 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4336) auto detect type from json string

2014-11-10 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-4336:
--

 Summary: auto detect type from json string
 Key: SPARK-4336
 URL: https://issues.apache.org/jira/browse/SPARK-4336
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2652) Turning default configurations for PySpark

2014-11-10 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206013#comment-14206013
 ] 

Xiangrui Meng commented on SPARK-2652:
--

We kept Kryo as the default serialization in 1.1.1.

> Turning default configurations for PySpark
> --
>
> Key: SPARK-2652
> URL: https://issues.apache.org/jira/browse/SPARK-2652
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: Configuration, Python
> Fix For: 1.2.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some default value of configuration does not make sense for PySpark, change 
> them to reasonable ones, such as spark.serializer and 
> spark.kryo.referenceTracking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2652) Turning default configurations for PySpark

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2652:
-
Target Version/s: 1.2.0  (was: 1.1.0)

> Turning default configurations for PySpark
> --
>
> Key: SPARK-2652
> URL: https://issues.apache.org/jira/browse/SPARK-2652
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: Configuration, Python
> Fix For: 1.2.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some default value of configuration does not make sense for PySpark, change 
> them to reasonable ones, such as spark.serializer and 
> spark.kryo.referenceTracking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2652) Turning default configurations for PySpark

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2652:
-
Fix Version/s: (was: 1.1.1)

> Turning default configurations for PySpark
> --
>
> Key: SPARK-2652
> URL: https://issues.apache.org/jira/browse/SPARK-2652
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: Configuration, Python
> Fix For: 1.2.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some default value of configuration does not make sense for PySpark, change 
> them to reasonable ones, such as spark.serializer and 
> spark.kryo.referenceTracking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4330:
-
Assignee: Kousuke Saruta

> Link to proper URL for YARN overview
> 
>
> Key: SPARK-4330
> URL: https://issues.apache.org/jira/browse/SPARK-4330
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> In running-on-yarn.md, a link to YARN overview is here.
> But the URL is to YARN alpha's.
> It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4330.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s:   (was: 1.3.0)

> Link to proper URL for YARN overview
> 
>
> Key: SPARK-4330
> URL: https://issues.apache.org/jira/browse/SPARK-4330
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> In running-on-yarn.md, a link to YARN overview is here.
> But the URL is to YARN alpha's.
> It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode

2014-11-10 Thread WangTaoTheTonic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-4253:
---
Summary: Ignore spark.driver.host in yarn-cluster and standalone-cluster 
mode  (was: Ignore spark.driver.host in yarn-cluster mode)

> Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
> 
>
> Key: SPARK-4253
> URL: https://issues.apache.org/jira/browse/SPARK-4253
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
> Attachments: Cannot assign requested address.txt
>
>
> We actually don't know where driver will be before it is launched in 
> yarn-cluster mode. If we set spark.driver.host property, Spark will create 
> Actor on the hostname or ip as setted, which will leads an error.
> So we should ignore this config item in yarn-cluster mode.
> As [~joshrosen]] pointed, we also ignore it in standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster mode

2014-11-10 Thread WangTaoTheTonic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-4253:
---
Description: 
We actually don't know where driver will be before it is launched in 
yarn-cluster mode. If we set spark.driver.host property, Spark will create 
Actor on the hostname or ip as setted, which will leads an error.
So we should ignore this config item in yarn-cluster mode.

As [~joshrosen]] pointed, we also ignore it in standalone cluster mode.

  was:
We actually don't know where driver will be before it is launched in 
yarn-cluster mode. If we set spark.driver.host property, Spark will create 
Actor on the hostname or ip as setted, which will leads an error.
So we should ignore this config item in yarn-cluster mode.


> Ignore spark.driver.host in yarn-cluster mode
> -
>
> Key: SPARK-4253
> URL: https://issues.apache.org/jira/browse/SPARK-4253
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
> Attachments: Cannot assign requested address.txt
>
>
> We actually don't know where driver will be before it is launched in 
> yarn-cluster mode. If we set spark.driver.host property, Spark will create 
> Actor on the hostname or ip as setted, which will leads an error.
> So we should ignore this config item in yarn-cluster mode.
> As [~joshrosen]] pointed, we also ignore it in standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3974) Block matrix abstracitons and partitioners

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205975#comment-14205975
 ] 

Apache Spark commented on SPARK-3974:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/3200

> Block matrix abstracitons and partitioners
> --
>
> Key: SPARK-3974
> URL: https://issues.apache.org/jira/browse/SPARK-3974
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>Assignee: Burak Yavuz
>
> We need abstractions for block matrices with fixed block sizes, with each 
> block being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4335) Mima check misreporting for GraphX pull request

2014-11-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4335:
---
Assignee: Prashant Sharma

> Mima check misreporting for GraphX pull request
> ---
>
> Key: SPARK-4335
> URL: https://issues.apache.org/jira/browse/SPARK-4335
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Reynold Xin
>Assignee: Prashant Sharma
>Priority: Blocker
>
> Mima is reporting binary check failure for RDD/partitions in the following 
> pull request even though the pull request doesn't change those at all.
> https://github.com/apache/spark/pull/2530
> {code}
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [info] Done updating.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3936) Incorrect result in GraphX BytecodeUtils with closures + class/object methods

2014-11-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3936:
---
Priority: Blocker  (was: Major)

> Incorrect result in GraphX BytecodeUtils with closures + class/object methods
> -
>
> Key: SPARK-3936
> URL: https://issues.apache.org/jira/browse/SPARK-3936
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Pedro Rodriguez
>Assignee: Ankur Dave
>Priority: Blocker
>
> There seems to be a bug with the GraphX byte code inspection, specifically in 
> BytecodeUtils. 
> These are the unit tests I wrote to expose the problem:
> https://github.com/EntilZha/spark/blob/a3c38a8329545c034fae2458df134fa3829d08fb/graphx/src/test/scala/org/apache/spark/graphx/util/BytecodeUtilsSuite.scala#L93-L121
> The first two tests pass, the second two tests fail. This exposes a problem 
> with inspection of methods in closures, in this case within maps. 
> Specifically, it seems like there is a problem with inspection of non-inline 
> methods in a closure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills

2014-11-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3649.

   Resolution: Fixed
Fix Version/s: 1.2.0

> ClassCastException in GraphX custom serializers when sort-based shuffle spills
> --
>
> Key: SPARK-3649
> URL: https://issues.apache.org/jira/browse/SPARK-3649
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.2.0
>
>
> As 
> [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
>  on the mailing list, GraphX throws
> {code}
> java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
> at 
> org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
>  
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
>  
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
> {code}
> when sort-based shuffle attempts to spill to disk. This is because GraphX 
> defines custom serializers for shuffling pair RDDs that assume Spark will 
> always serialize the entire pair object rather than breaking it up into its 
> components. However, the spill code path in sort-based shuffle [violates this 
> assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].
> GraphX uses the custom serializers to compress vertex ID keys using 
> variable-length integer encoding. However, since the serializer can no longer 
> rely on the key and value being serialized and deserialized together, 
> performing such encoding would require writing a tag byte. Therefore it may 
> be better to simply remove the custom serializers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD

2014-11-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3666:
---
Priority: Blocker  (was: Major)

> Extract interfaces for EdgeRDD and VertexRDD
> 
>
> Key: SPARK-3666
> URL: https://issues.apache.org/jira/browse/SPARK-3666
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4335) Mima check misreporting for GraphX pull request

2014-11-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4335:
---
Priority: Blocker  (was: Major)

> Mima check misreporting for GraphX pull request
> ---
>
> Key: SPARK-4335
> URL: https://issues.apache.org/jira/browse/SPARK-4335
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Reynold Xin
>Priority: Blocker
>
> Mima is reporting binary check failure for RDD/partitions in the following 
> pull request even though the pull request doesn't change those at all.
> https://github.com/apache/spark/pull/2530
> {code}
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [error]  * abstract method getPartitions()Array[org.apache.spark.Partition] 
> in class org.apache.spark.rdd.RDD does not have a correspondent in new version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
> [error]  * abstract method 
> compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
>  in class org.apache.spark.rdd.RDD does not have a correspondent in new 
> version
> [error]filter with: 
> ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
> [info] Done updating.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4335) Mima check misreporting for GraphX pull request

2014-11-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-4335:
--

 Summary: Mima check misreporting for GraphX pull request
 Key: SPARK-4335
 URL: https://issues.apache.org/jira/browse/SPARK-4335
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Reynold Xin


Mima is reporting binary check failure for RDD/partitions in the following pull 
request even though the pull request doesn't change those at all.

https://github.com/apache/spark/pull/2530

{code}
[error]  * abstract method getPartitions()Array[org.apache.spark.Partition] in 
class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]filter with: 
ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
[error]  * abstract method 
compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
 in class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]filter with: 
ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
[error]  * abstract method getPartitions()Array[org.apache.spark.Partition] in 
class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]filter with: 
ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions")
[error]  * abstract method 
compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator
 in class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]filter with: 
ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute")
[info] Done updating.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4334) Utils.startServiceOnPort should check whether the tryPort is less than 1024

2014-11-10 Thread YanTang Zhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YanTang Zhai resolved SPARK-4334.
-
Resolution: Duplicate

> Utils.startServiceOnPort should check whether the tryPort is less than 1024
> ---
>
> Key: SPARK-4334
> URL: https://issues.apache.org/jira/browse/SPARK-4334
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> Utils.startServiceOnPort should check whether the tryPort is less than 1024.
> If next attempt port is less than 1024, SockertException with "Permission 
> denied" will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4334) Utils.startServiceOnPort should check whether the tryPort is less than 1024

2014-11-10 Thread YanTang Zhai (JIRA)
YanTang Zhai created SPARK-4334:
---

 Summary: Utils.startServiceOnPort should check whether the tryPort 
is less than 1024
 Key: SPARK-4334
 URL: https://issues.apache.org/jira/browse/SPARK-4334
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor


Utils.startServiceOnPort should check whether the tryPort is less than 1024.
If next attempt port is less than 1024, SockertException with "Permission 
denied" will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-10 Thread Varadharajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varadharajan updated SPARK-4306:

Description: Currently we are supporting LogisticRegressionWithSGD in the 
PySpark MLlib interfact. This task is to add support for 
LogisticRegressionWithLBFGS algorithm.  (was: Currently we are supporting 
LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add 
support for LogisticRegressionWithLBFGS algorithm as include examples.)

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) Correctly log number of iterations in RuleExecutor

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Summary: Correctly log number of iterations in RuleExecutor  (was: change 
num of iteration printed in trace log from ${iteration} to ${iteration - 1})

> Correctly log number of iterations in RuleExecutor
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be $(iteration -1) not 
> (iteration) .
> Log looks like "Fixed point reached for batch $(batch.name) after 3 
> iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) Correctly log number of iterations in RuleExecutor

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Description: 
RuleExecutor breaks, num of iteration should be $(iteration -1) not 
$(iteration) .
Log looks like "Fixed point reached for batch $(batch.name) after 3 
iterations.", but it did 2 iterations really!

  was:
RuleExecutor breaks, num of iteration should be $(iteration -1) not (iteration) 
.
Log looks like "Fixed point reached for batch $(batch.name) after 3 
iterations.", but it did 2 iterations really!


> Correctly log number of iterations in RuleExecutor
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be $(iteration -1) not 
> $(iteration) .
> Log looks like "Fixed point reached for batch $(batch.name) after 3 
> iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205853#comment-14205853
 ] 

Apache Spark commented on SPARK-4333:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/3180

> change num of iteration printed in trace log from ${iteration} to ${iteration 
> - 1}
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be $(iteration -1) not 
> (iteration) .
> Log looks like "Fixed point reached for batch $(batch.name) after 3 
> iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Description: 
RuleExecutor breaks, num of iteration should be $(iteration -1) not (iteration) 
.
Log looks like "Fixed point reached for batch $(batch.name) after 3 
iterations.", but it did 2 iterations really!

  was:
RuleExecutor breaks, num of iteration should be ${iteration -1} not {iteration} 
.
Log looks like "Fixed point reached for batch ${batch.name} after 3 
iterations.", but it did 2 iterations really!


> change num of iteration printed in trace log from ${iteration} to ${iteration 
> - 1}
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be $(iteration -1) not 
> (iteration) .
> Log looks like "Fixed point reached for batch $(batch.name) after 3 
> iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Description: 
RuleExecutor breaks, num of iteration should be ${iteration -1} not {iteration} 
.
Log looks like "Fixed point reached for batch ${batch.name} after 3 
iterations.", but it did 2 iterations really!

  was:RuleExecutor breaks, num of iteration should be {iteration -1} not 
{iteration} .Log looks like "Fixed point reached for batch {batch.name} after 3 
iterations.", but it did 2 iterations really!


> change num of iteration printed in trace log from ${iteration} to ${iteration 
> - 1}
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> {iteration} .
> Log looks like "Fixed point reached for batch ${batch.name} after 3 
> iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3495:
-
Fix Version/s: 1.1.1

> Block replication fails continuously when the replication target node is dead
> -
>
> Key: SPARK-3495
> URL: https://issues.apache.org/jira/browse/SPARK-3495
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> If a block manager (say, A) wants to replicate a block and the node chosen 
> for replication (say, B) is dead, then the attempt to send the block to B 
> fails. However, this continues to fail indefinitely. Even if the driver 
> learns about the demise of the B, A continues to try replicating to B and 
> failing miserably. 
> The reason behind this bug is that A initially fetches a list of peers from 
> the driver (when B was active), but never updates it after B is dead. This 
> affects Spark Streaming as its receiver uses block replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3496:
-
Fix Version/s: 1.1.1

> Block replication can by mistake choose driver BlockManager as a peer for 
> replication
> -
>
> Key: SPARK-3496
> URL: https://issues.apache.org/jira/browse/SPARK-3496
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> When selecting peer block managers for replicating a block, the driver block 
> manager can also get chosen accidentally. This is because 
> BlockManagerMasterActor did not filter out the driver block manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Description: RuleExecutor breaks, num of iteration should be {iteration -1} 
not {iteration} .Log looks like "Fixed point reached for batch {batch.name} 
after 3 iterations.", but it did 2 iterations really!  (was: RuleExecutor 
breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks 
like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it 
did 2 iterations really!)

> change num of iteration printed in trace log from ${iteration} to ${iteration 
> - 1}
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be {iteration -1} not 
> {iteration} .Log looks like "Fixed point reached for batch {batch.name} after 
> 3 iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-4333:
-

 Summary: change num of iteration printed in trace log from 
${iteration} to ${iteration - 1}
 Key: SPARK-4333
 URL: https://issues.apache.org/jira/browse/SPARK-4333
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4333:
--
Description: RuleExecutor breaks, num of iteration should be ${iteration 
-1} not ${iteration} .Log looks like "Fixed point reached for batch 
${batch.name} after 3 iterations.", but it did 2 iterations really!

> change num of iteration printed in trace log from ${iteration} to ${iteration 
> - 1}
> --
>
> Key: SPARK-4333
> URL: https://issues.apache.org/jira/browse/SPARK-4333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825
 ] 

maji2014 edited comment on SPARK-4314 at 11/11/14 2:32 AM:
---

Yes, Not all of this intermediate state are caught. 
i wanna add following code into defaultFilter method under FileInputDStream.
Any suggestions?


was (Author: maji2014):
Yes, Not all of this intermediate state are caught. 
i wanna add following code into CustomPathFilter method under FileInputDStream 
class to filter this state from FileInputDStream interface.
  if(path.getName.endsWith("\_COPYING\_")){
  logDebug ")
  return false
}
Any suggestions?

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> sc

[jira] [Closed] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 closed SPARK-4332.
-
Resolution: Invalid

> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really! 
> 
>
> Key: SPARK-4332
> URL: https://issues.apache.org/jira/browse/SPARK-4332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4332:
--
Description: RuleExecutor breaks, num of iteration should be ${iteration 
-1} not ${iteration} .Log looks like "Fixed point reached for batch 
${batch.name} after 3 iterations.", but it did 2 iterations really!  (was: 
RuleExecutor breaks, num of iteration should be ${iteration -1} not 
${iteration} .Log looks like "Fixed point reached for batch ${batch.nam)

> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really! 
> 
>
> Key: SPARK-4332
> URL: https://issues.apache.org/jira/browse/SPARK-4332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it

2014-11-10 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4332:
--
Description: RuleExecutor breaks, num of iteration should be ${iteration 
-1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.nam

> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} 
> after 3 iterations.", but it did 2 iterations really! 
> 
>
> Key: SPARK-4332
> URL: https://issues.apache.org/jira/browse/SPARK-4332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: DoingDone9
>Priority: Minor
>
> RuleExecutor breaks, num of iteration should be ${iteration -1} not 
> ${iteration} .Log looks like "Fixed point reached for batch ${batch.nam



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it

2014-11-10 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-4332:
-

 Summary: RuleExecutor breaks, num of iteration should be 
${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch 
${batch.name} after 3 iterations.", but it did 2 iterations really! 
 Key: SPARK-4332
 URL: https://issues.apache.org/jira/browse/SPARK-4332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2775) HiveContext does not support dots in column names.

2014-11-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205829#comment-14205829
 ] 

Josh Rosen commented on SPARK-2775:
---

Chatted with [~marmbrus] about this and he pointed me to a workaround: use 
{{applySchema}} to replace the dots with some other separator (underscore, in 
this case):

{code}
  import sqlContext.applySchema

  /** Replaces . in column names with _ (underscore) */
  protected def cleanSchema(dataType: DataType): DataType = dataType match {
case StructType(fields) =>
  StructType(
fields.map(f =>
  f.copy(name = f.name.replaceAll(".", "_"), dataType = 
cleanSchema(f.dataType
case ArrayType(elemType, nullable) => ArrayType(cleanSchema(elemType), 
nullable)
case NullType => StringType
case other => other
  }

  /** Replaces . signs in column names with _ */
  protected def cleanSchema(schemaRDD: SchemaRDD): SchemaRDD = {
applySchema(schemaRDD, 
cleanSchema(schemaRDD.schema).asInstanceOf[StructType])
  }
{code}

I've translated this to Python:

{code}
from pyspark.sql import *
from copy import copy

def cleanSchema(schemaRDD):
  def renameField(field):
field = copy(field)
field.name = field.name.replace(".", "_")
field.dataType = doCleanSchema(field.dataType)
return field
  def doCleanSchema(dataType):
dataType = copy(dataType)
if isinstance(dataType, StructType):
  dataType.fields = [renameField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
  dataType.elementType = doCleanSchema(dataType.elementType )
return dataType
  
  return sqlContext.applySchema(schemaRDD.map(lambda x: x), 
doCleanSchema(schemaRDD.schema()))
  
print cleanSchema(resultsRDD)
{code}

> HiveContext does not support dots in column names. 
> ---
>
> Key: SPARK-2775
> URL: https://issues.apache.org/jira/browse/SPARK-2775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When you try the following snippet in hive/console. 
> {code}
> val data = sc.parallelize(Seq("""{"key.number1": "value1", "key.number2": 
> "value2"}"""))
> jsonRDD(data).registerAsTable("jt")
> hql("select `key.number1` from jt")
> {code}
> You will find the name of key.number1 cannot be resolved.
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: 'key.number1, tree:
> Project ['key.number1]
>  LowerCaseSchema 
>   Subquery jt
>SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] 
> at map at JsonRDD.scala:37)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825
 ] 

maji2014 edited comment on SPARK-4314 at 11/11/14 2:16 AM:
---

Yes, Not all of this intermediate state are caught. 
i wanna add following code into CustomPathFilter method under FileInputDStream 
class to filter this state from FileInputDStream interface.
  if(path.getName.endsWith("\_COPYING\_")){
  logDebug ")
  return false
}
Any suggestions?


was (Author: maji2014):
Yes, Not all of this intermediate state are caught. 
i wanna add following code into CustomPathFilter method under FileInputDStream 
class to filter this state from FileInputDStream interface.
  if(path.getName.endsWith("_COPYING_")){
  logDebug ")
  return false
}
Any suggestions?

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.ap

[jira] [Commented] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825
 ] 

maji2014 commented on SPARK-4314:
-

Yes, Not all of this intermediate state are caught. 
i wanna add following code into CustomPathFilter method under FileInputDStream 
class to filter this state from FileInputDStream interface.
  if(path.getName.endsWith("_COPYING_")){
  logDebug ")
  return false
}
Any suggestions?

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$

[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774
 ] 

maji2014 edited comment on SPARK-4314 at 11/11/14 2:07 AM:
---

The actual behavior is the intermediate file _COPYING_ is found by 
FileInputDStream interface when i upload a normal file rather than uploading a 
intermediate file.
Take following as examples, normally upload three files aa,bb,cc
**
hdfs://master:9000/user/spark/aa
14/11/09 23:16:44 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/aa:0+8
hdfs://master:9000/user/spark/bb
14/11/09 23:16:58 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/bb:0+7
hdfs://master:9000/user/spark/cc._COPYING_
14/11/09 23:17:06 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc._COPYING_:0+0
hdfs://master:9000/user/spark/cc
14/11/09 23:17:08 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc:0+6
**
and then file cc is counted two times because cc._COPYING_ is also counted. The 
issue exists in a special scenario.
Do a simple test, upload a file directly like dd._COPYING._(sometimes this 
intermediate state will be read by FileInputDStream interface), the file can be 
counted, that's not the expect behavior.


was (Author: maji2014):
The actual behavior is the intermediate file _COPYING_ is found by 
FileInputDStream interface when i upload a normal file rather than uploading a 
intermediate file.
Take following as examples, normally upload three files aa,bb,cc
**
hdfs://master:9000/user/spark/aa
14/11/09 23:16:44 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/aa:0+8
hdfs://master:9000/user/spark/bb
14/11/09 23:16:58 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/bb:0+7
hdfs://master:9000/user/spark/cc._COPYING_
14/11/09 23:17:06 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc._COPYING_:0+0
hdfs://master:9000/user/spark/cc
14/11/09 23:17:08 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc:0+6
**
and then file cc is counted two times because cc._COPYING_ is also counted. The 
issue exists in a special scenario.

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream

[jira] [Commented] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1

2014-11-10 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205816#comment-14205816
 ] 

Kousuke Saruta commented on SPARK-4331:
---

[~pwendell] You mean Scala 2.11 patch introduces changes including more of non 
standard directory structure so it's not good  to fix this issue for now right?

> SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
> 
>
> Key: SPARK-4331
> URL: https://issues.apache.org/jira/browse/SPARK-4331
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle 
> plugin so scalastyle doesn't work for sources under those directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1

2014-11-10 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-4331:
--
Summary: SBT Scalastyle doesn't work for the sources under hive's v0.12.0 
and v0.13.1  (was: Scalastyle doesn't work for the sources under hive's v0.12.0 
and v0.13.1)

> SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
> 
>
> Key: SPARK-4331
> URL: https://issues.apache.org/jira/browse/SPARK-4331
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle 
> plugin so scalastyle doesn't work for sources under those directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4316) Utils.isBindCollision misjudges at Non-English environment

2014-11-10 Thread YanTang Zhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205810#comment-14205810
 ] 

YanTang Zhai commented on SPARK-4316:
-

Thanks @Sean Owen

> Utils.isBindCollision misjudges at Non-English environment
> --
>
> Key: SPARK-4316
> URL: https://issues.apache.org/jira/browse/SPARK-4316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>Priority: Minor
>
> export LANG=zh_CN.utf8
> export LC_ALL=zh_CN.utf8
> Then Utils.isBindCollision misjudges since BindException's message is 
> "地址已在使用" not "Address already in use".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4314:
---
Summary: Exception when textFileStream attempts to read deleted _COPYING_ 
file  (was: Exception throws when the upload intermediate file(_COPYING_ file) 
is read through hdfs interface)

> Exception when textFileStream attempts to read deleted _COPYING_ file
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.flatMap(Traversable

[jira] [Comment Edited] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface

2014-11-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205799#comment-14205799
 ] 

Patrick Wendell edited comment on SPARK-4314 at 11/11/14 1:56 AM:
--

Can you add a filter that excludes COPYING files? My guess is that at one time 
it lists the file and at another time it tries to open it. IIRC Spark Streaming 
supports filters to remove ephemeral files like this.


was (Author: pwendell):
Can you add a filter that excludes COPYING files?

> Exception throws when the upload intermediate file(_COPYING_ file) is read 
> through hdfs interface
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$

[jira] [Updated] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface

2014-11-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4314:
---
Component/s: Streaming

> Exception throws when the upload intermediate file(_COPYING_ file) is read 
> through hdfs interface
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>  at 
> org.ap

[jira] [Commented] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface

2014-11-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205799#comment-14205799
 ] 

Patrick Wendell commented on SPARK-4314:


Can you add a filter that excludes COPYING files?

> Exception throws when the upload intermediate file(_COPYING_ file) is read 
> through hdfs interface
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>  at scala.collection.AbstractTraversabl

[jira] [Resolved] (SPARK-4274) NPE in printing the details of query plan

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4274.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3139
[https://github.com/apache/spark/pull/3139]

> NPE in printing the details of query plan
> -
>
> Key: SPARK-4274
> URL: https://issues.apache.org/jira/browse/SPARK-4274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
> Fix For: 1.2.0
>
>
> NPE in printing the details of query plan, if the query is not valid. This 
> will be great helpful in Hive comparison test, which could provide more 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4331) Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1

2014-11-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205795#comment-14205795
 ] 

Patrick Wendell commented on SPARK-4331:


This is going to be exacerbated by the Scala 2.11 patch which will add more of 
these nonstatdard directory structures.

> Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
> 
>
> Key: SPARK-4331
> URL: https://issues.apache.org/jira/browse/SPARK-4331
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle 
> plugin so scalastyle doesn't work for sources under those directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4331) Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1

2014-11-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-4331:
-

 Summary: Scalastyle doesn't work for the sources under hive's 
v0.12.0 and v0.13.1
 Key: SPARK-4331
 URL: https://issues.apache.org/jira/browse/SPARK-4331
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle 
plugin so scalastyle doesn't work for sources under those directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4274) NPE in printing the details of query plan

2014-11-10 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4274:
-
Summary: NPE in printing the details of query plan  (was: NEP in printing 
the details of query plan)

> NPE in printing the details of query plan
> -
>
> Key: SPARK-4274
> URL: https://issues.apache.org/jira/browse/SPARK-4274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> NEP in printing the details of query plan, if the query is not valid. this 
> will great helpful in Hive comparison test, which could provide more 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4274) NPE in printing the details of query plan

2014-11-10 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4274:
-
Description: NPE in printing the details of query plan, if the query is not 
valid. This will be great helpful in Hive comparison test, which could provide 
more information.  (was: NEP in printing the details of query plan, if the 
query is not valid. this will great helpful in Hive comparison test, which 
could provide more information.)

> NPE in printing the details of query plan
> -
>
> Key: SPARK-4274
> URL: https://issues.apache.org/jira/browse/SPARK-4274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> NPE in printing the details of query plan, if the query is not valid. This 
> will be great helpful in Hive comparison test, which could provide more 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4274) NEP in printing the details of query plan

2014-11-10 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4274:
-
Description: NEP in printing the details of query plan, if the query is not 
valid. this will great helpful in Hive comparison test, which could provide 
more information.  (was: Hive comparison test frame doesn't print informative 
message, if the unit test failed in logical plan analyzing.)

> NEP in printing the details of query plan
> -
>
> Key: SPARK-4274
> URL: https://issues.apache.org/jira/browse/SPARK-4274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> NEP in printing the details of query plan, if the query is not valid. this 
> will great helpful in Hive comparison test, which could provide more 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4274) NEP in printing the details of query plan

2014-11-10 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4274:
-
Summary: NEP in printing the details of query plan  (was: Hive comparison 
test framework doesn't print effective information while logical plan analyzing 
failed )

> NEP in printing the details of query plan
> -
>
> Key: SPARK-4274
> URL: https://issues.apache.org/jira/browse/SPARK-4274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Hive comparison test frame doesn't print informative message, if the unit 
> test failed in logical plan analyzing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774
 ] 

maji2014 edited comment on SPARK-4314 at 11/11/14 1:48 AM:
---

The actual behavior is the intermediate file _COPYING_ is found by 
FileInputDStream interface when i upload a normal file rather than uploading a 
intermediate file.
Take following as examples, normally upload three files aa,bb,cc
**
hdfs://master:9000/user/spark/aa
14/11/09 23:16:44 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/aa:0+8
hdfs://master:9000/user/spark/bb
14/11/09 23:16:58 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/bb:0+7
hdfs://master:9000/user/spark/cc._COPYING_
14/11/09 23:17:06 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc._COPYING_:0+0
hdfs://master:9000/user/spark/cc
14/11/09 23:17:08 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc:0+6
**
and then file cc is counted two times because cc._COPYING_ is also counted. The 
issue exists in a special scenario.


was (Author: maji2014):
The actual behavior is the intermediate file _COPYING_ is found by 
FileInputDStream interface when i upload a normal file rather than uploading a 
intermediate file.
take following as examples:
**
hdfs://master:9000/user/spark/aa
14/11/09 23:16:44 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/aa:0+8
hdfs://master:9000/user/spark/bb
14/11/09 23:16:58 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/bb:0+7
hdfs://master:9000/user/spark/cc._COPYING_
14/11/09 23:17:06 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc._COPYING_:0+0
hdfs://master:9000/user/spark/cc
14/11/09 23:17:08 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc:0+6
**
and then file cc is counted two times. The issue exists in a special scenario.

> Exception throws when the upload intermediate file(_COPYING_ file) is read 
> through hdfs interface
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DS

[jira] [Updated] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface

2014-11-10 Thread maji2014 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maji2014 updated SPARK-4314:

Summary: Exception throws when the upload intermediate file(_COPYING_ file) 
is read through hdfs interface  (was: Exception throws when finding new files 
like intermediate result(_COPYING_ file) through hdfs interface)

> Exception throws when the upload intermediate file(_COPYING_ file) is read 
> through hdfs interface
> -
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collec

[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4119:

Target Version/s: 1.3.0  (was: 1.2.0)

> Don't rely on HIVE_DEV_HOME to find .q files
> 
>
> Key: SPARK-4119
> URL: https://issues.apache.org/jira/browse/SPARK-4119
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
> files got updated. Unfortunately, some .q were updated in Hive. For example, 
> an ORDER BY clause was added to groupby1_limit.q for bug fix.
> With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with 
> false test failures. Because .q files are looked up from HIVE_DEV_HOME and 
> outdated .q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4119:

Assignee: Cheng Lian

> Don't rely on HIVE_DEV_HOME to find .q files
> 
>
> Key: SPARK-4119
> URL: https://issues.apache.org/jira/browse/SPARK-4119
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
> files got updated. Unfortunately, some .q were updated in Hive. For example, 
> an ORDER BY clause was added to groupby1_limit.q for bug fix.
> With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with 
> false test failures. Because .q files are looked up from HIVE_DEV_HOME and 
> outdated .q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4314) Exception throws when finding new files like intermediate result(_COPYING_ file) through hdfs interface

2014-11-10 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774
 ] 

maji2014 commented on SPARK-4314:
-

The actual behavior is the intermediate file _COPYING_ is found by 
FileInputDStream interface when i upload a normal file rather than uploading a 
intermediate file.
take following as examples:
**
hdfs://master:9000/user/spark/aa
14/11/09 23:16:44 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/aa:0+8
hdfs://master:9000/user/spark/bb
14/11/09 23:16:58 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/bb:0+7
hdfs://master:9000/user/spark/cc._COPYING_
14/11/09 23:17:06 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc._COPYING_:0+0
hdfs://master:9000/user/spark/cc
14/11/09 23:17:08 INFO NewHadoopRDD: Input split: 
hdfs://master:9000/user/spark/cc:0+6
**
and then file cc is counted two times. The issue exists in a special scenario.

> Exception throws when finding new files like intermediate result(_COPYING_ 
> file) through hdfs interface
> ---
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.g

[jira] [Updated] (SPARK-3954) Optimization to FileInputDStream

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3954:
-
Fix Version/s: 1.1.1

> Optimization to FileInputDStream
> 
>
> Key: SPARK-3954
> URL: https://issues.apache.org/jira/browse/SPARK-3954
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0
>Reporter: 宿荣全
> Fix For: 1.1.1, 1.2.0
>
>
> about convert files to RDDS there are 3 loops with files sequence in spark 
> source.
> loops files sequence:
> 1.files.map(...)
> 2.files.zip(fileRDDs)
> 3.files-size.foreach
> modiy 3 recursions  to 1 recursion.
> spark source code:
>   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
> val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, 
> V, F](file))
> files.zip(fileRDDs).foreach { case (file, rdd) => {
>   if (rdd.partitions.size == 0) {
> logError("File " + file + " has no data in it. Spark Streaming can 
> only ingest " +
>   "files that have been \"moved\" to the directory assigned to the 
> file stream. " +
>   "Refer to the streaming programming guide for more details.")
>   }
> }}
> new UnionRDD(context.sparkContext, fileRDDs)
>   }
> // 
> ---
> modified code:
>   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
> val fileRDDs = for (file <- files; rdd = 
> context.sparkContext.newAPIHadoopFile[K, V, F](file)) 
>   yield {
>   if (rdd.partitions.size == 0) {
> logError("File " + file + " has no data in it. Spark Streaming can 
> only ingest " +
>   "files that have been \"moved\" to the directory assigned to the 
> file stream. " +
>   "Refer to the streaming programming guide for more details.")
>   }
>   rdd
> }
> new UnionRDD(context.sparkContext, fileRDDs)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3954) Optimization to FileInputDStream

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-3954.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Optimization to FileInputDStream
> 
>
> Key: SPARK-3954
> URL: https://issues.apache.org/jira/browse/SPARK-3954
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0
>Reporter: 宿荣全
> Fix For: 1.2.0
>
>
> about convert files to RDDS there are 3 loops with files sequence in spark 
> source.
> loops files sequence:
> 1.files.map(...)
> 2.files.zip(fileRDDs)
> 3.files-size.foreach
> modiy 3 recursions  to 1 recursion.
> spark source code:
>   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
> val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, 
> V, F](file))
> files.zip(fileRDDs).foreach { case (file, rdd) => {
>   if (rdd.partitions.size == 0) {
> logError("File " + file + " has no data in it. Spark Streaming can 
> only ingest " +
>   "files that have been \"moved\" to the directory assigned to the 
> file stream. " +
>   "Refer to the streaming programming guide for more details.")
>   }
> }}
> new UnionRDD(context.sparkContext, fileRDDs)
>   }
> // 
> ---
> modified code:
>   private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
> val fileRDDs = for (file <- files; rdd = 
> context.sparkContext.newAPIHadoopFile[K, V, F](file)) 
>   yield {
>   if (rdd.partitions.size == 0) {
> logError("File " + file + " has no data in it. Spark Streaming can 
> only ingest " +
>   "files that have been \"moved\" to the directory assigned to the 
> file stream. " +
>   "Refer to the streaming programming guide for more details.")
>   }
>   rdd
> }
> new UnionRDD(context.sparkContext, fileRDDs)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4149) ISO 8601 support for json date time strings

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4149.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3012
[https://github.com/apache/spark/pull/3012]

> ISO 8601 support for json date time strings
> ---
>
> Key: SPARK-4149
> URL: https://issues.apache.org/jira/browse/SPARK-4149
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.2.0
>
>
> parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4250) Create constant null value for Hive Inspectors

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4250.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3114
[https://github.com/apache/spark/pull/3114]

> Create constant null value for Hive Inspectors
> --
>
> Key: SPARK-4250
> URL: https://issues.apache.org/jira/browse/SPARK-4250
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
> Fix For: 1.2.0
>
>
> Constant null value is not accepted while creating the Hive Inspectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2205:

Target Version/s: 1.3.0  (was: 1.2.0)

> Unnecessary exchange operators in a join on multiple tables with the same 
> join key.
> ---
>
> Key: SPARK-2205
> URL: https://issues.apache.org/jira/browse/SPARK-2205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
>
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (y.key=z.key)")
> SchemaRDD[1] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
>  HashJoin [key#6], [key#8], BuildRight
>   Exchange (HashPartitioning [key#6], 200)
>HashJoin [key#4], [key#6], BuildRight
> Exchange (HashPartitioning [key#4], 200)
>  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
> Some(x)), None
> Exchange (HashPartitioning [key#6], 200)
>  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#8], 200)
>HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
> None
> {code}
> However, this is fine...
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (x.key=z.key)")
> res5: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[5] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
>  HashJoin [key#26], [key#30], BuildRight
>   HashJoin [key#26], [key#28], BuildRight
>Exchange (HashPartitioning [key#26], 200)
> HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
> Some(x)), None
>Exchange (HashPartitioning [key#28], 200)
> HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#30], 200)
>HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
> Some(z)), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205731#comment-14205731
 ] 

Apache Spark commented on SPARK-3461:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3198

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4308) SQL operation state is not properly set when exception is thrown

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4308.
-
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 3175
[https://github.com/apache/spark/pull/3175]

> SQL operation state is not properly set when exception is thrown
> 
>
> Key: SPARK-4308
> URL: https://issues.apache.org/jira/browse/SPARK-4308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>Priority: Minor
> Fix For: 1.2.0, 1.1.1
>
>
> In {{HiveThriftServer2}}, when an exception is thrown during a SQL execution, 
> the SQL operation state should be set to {{ERROR}}, but now it remains 
> {{RUNNING}}. This affects the result of the GetOperationStatus Thrift API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4202) DSL support for Scala UDF

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4202.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Cheng Lian

> DSL support for Scala UDF
> -
>
> Key: SPARK-4202
> URL: https://issues.apache.org/jira/browse/SPARK-4202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.2.0
>
>
> Using Scala UDF with current DSL API is quite verbose, e.g.:
> {code}
> case class KeyValue(key: Int, value: String)
> val schemaRDD = sc.parallelize(1 to 10).map(i => KeyValue(i, 
> i.toString)).toSchemaRDD
> def foo = (a: Int, b: String) => a.toString + b
> schemaRDD.select( // SELECT
>   Star(None), // *,
>   ScalaUdf(   //
> foo,  // foo(
> StringType,   //
> 'key.attr :: 'value.attr :: Nil)  //   key, value
> ).collect()   // ) FROM ...
> {code}
> It would be good to add a DSL syntax to simplify UDF invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205680#comment-14205680
 ] 

Apache Spark commented on SPARK-4330:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3196

> Link to proper URL for YARN overview
> 
>
> Key: SPARK-4330
> URL: https://issues.apache.org/jira/browse/SPARK-4330
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In running-on-yarn.md, a link to YARN overview is here.
> But the URL is to YARN alpha's.
> It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-4330:
-

 Summary: Link to proper URL for YARN overview
 Key: SPARK-4330
 URL: https://issues.apache.org/jira/browse/SPARK-4330
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Priority: Minor


In running-on-yarn.md, a link to YARN overview is here.
But the URL is to YARN alpha's.
It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205637#comment-14205637
 ] 

Joseph K. Bradley commented on SPARK-3717:
--

[~codedeft] Are you asking me or [~bbnsumanth]?  If me, I sketched out my 
thoughts above, but please let me know if parts are unclear.

[~bbnsumanth] I'll look forward to seeing how things progress.

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205627#comment-14205627
 ] 

Apache Spark commented on SPARK-4325:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3195

> Improve spark-ec2 cluster launch times
> --
>
> Key: SPARK-4325
> URL: https://issues.apache.org/jira/browse/SPARK-4325
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are several optimizations we know we can make to [{{setup.sh}} | 
> https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
> faster.
> There are also some improvements to the AMIs that will help a lot.
> Potential improvements:
> * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
> will reduce or eliminate SSH wait time and Ganglia init time.
> * Replace instances of {{download; rsync to rest of cluster}} with parallel 
> downloads on all nodes of the cluster.
> * Replace instances of 
>  {code}
> for node in $NODES; do
>   command
>   sleep 0.3
> done
> wait{code}
>  with simpler calls to {{pssh}}.
> * Remove the [linear backoff | 
> https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
>  when we wait for SSH availability now that we are already waiting for EC2 
> status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4329) Add indexing feature for HistoryPage

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205625#comment-14205625
 ] 

Apache Spark commented on SPARK-4329:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3194

> Add indexing feature for HistoryPage
> 
>
> Key: SPARK-4329
> URL: https://issues.apache.org/jira/browse/SPARK-4329
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> Current HistoryPage have links only to previous page or next page.
> I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205626#comment-14205626
 ] 

Apache Spark commented on SPARK-3398:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3195

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4329) Add indexing feature for HistoryPage

2014-11-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-4329:
-

 Summary: Add indexing feature for HistoryPage
 Key: SPARK-4329
 URL: https://issues.apache.org/jira/browse/SPARK-4329
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


Current HistoryPage have links only to previous page or next page.
I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4327) Python API for RDD.randomSplit()

2014-11-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205603#comment-14205603
 ] 

Apache Spark commented on SPARK-4327:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3193

> Python API for RDD.randomSplit()
> 
>
> Key: SPARK-4327
> URL: https://issues.apache.org/jira/browse/SPARK-4327
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Davies Liu
>Priority: Critical
>
> randomSplit() is used in MLlib to split the dataset into training and testing 
> set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4319) Enable an ignored test "null count".

2014-11-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4319.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3185
[https://github.com/apache/spark/pull/3185]

> Enable an ignored test "null count".
> 
>
> Key: SPARK-4319
> URL: https://issues.apache.org/jira/browse/SPARK-4319
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Takuya Ueshin
>Priority: Minor
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-11-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3461:
---
Assignee: Sandy Ryza  (was: Davies Liu)

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-11-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205520#comment-14205520
 ] 

Patrick Wendell commented on SPARK-3461:


I think [~sandyr] wanted to take a crack a this so I'm assigning to him.

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3495:
-
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

> Block replication fails continuously when the replication target node is dead
> -
>
> Key: SPARK-3495
> URL: https://issues.apache.org/jira/browse/SPARK-3495
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.2.0
>
>
> If a block manager (say, A) wants to replicate a block and the node chosen 
> for replication (say, B) is dead, then the attempt to send the block to B 
> fails. However, this continues to fail indefinitely. Even if the driver 
> learns about the demise of the B, A continues to try replicating to B and 
> failing miserably. 
> The reason behind this bug is that A initially fetches a list of peers from 
> the driver (when B was active), but never updates it after B is dead. This 
> affects Spark Streaming as its receiver uses block replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication

2014-11-10 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3496:
-
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

> Block replication can by mistake choose driver BlockManager as a peer for 
> replication
> -
>
> Key: SPARK-3496
> URL: https://issues.apache.org/jira/browse/SPARK-3496
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.2.0
>
>
> When selecting peer block managers for replicating a block, the driver block 
> manager can also get chosen accidentally. This is because 
> BlockManagerMasterActor did not filter out the driver block manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.

2014-11-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205486#comment-14205486
 ] 

Patrick Wendell commented on SPARK-2703:


FYI I had to revert this patch because it looked like it broke our maven build

> Make Tachyon related unit tests execute without deploying a Tachyon system 
> locally.
> ---
>
> Key: SPARK-2703
> URL: https://issues.apache.org/jira/browse/SPARK-2703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Haoyuan Li
>Assignee: Rong Gu
> Fix For: 1.2.0
>
>
> Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4328.
--
Resolution: Duplicate

This is covered in the PR for SPARK-4324.

> Python serialization updates make Python ML API more brittle to types
> -
>
> Key: SPARK-4328
> URL: https://issues.apache.org/jira/browse/SPARK-4328
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> In Spark 1.1, you could create a LabeledPoint with labels specified as 
> integers, and then use it with LinearRegression.  This was broken by the 
> Python API updates since then.  E.g., this code runs in the 1.1 branch but 
> not in the current master:
> {code}
> from pyspark.mllib.regression import *
> import numpy
> features = numpy.ndarray((3))
> data = sc.parallelize([LabeledPoint(1, features)])
> LinearRegressionWithSGD.train(data)
> {code}
> Recommendation: Allow users to use integers from Python.
> The error message you get is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o55.trainLinearRegressionModelWithSGD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
> (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
> be cast to java.lang.Double
>   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
>   at 
> org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Priority: Critical  (was: Major)

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2199:
-
Assignee: Valeriy Avanesov

> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Denis Turdakov
>Assignee: Valeriy Avanesov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1473:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Support multi-model training in MLlib
> -
>
> Key: SPARK-1486
> URL: https://issues.apache.org/jira/browse/SPARK-1486
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>Priority: Critical
>
> It is rare in practice to train just one model with a given set of 
> parameters. Usually, this is done by training multiple models with different 
> sets of parameters and then select the best based on their performance on the 
> validation set. MLlib should provide native support for multi-model 
> training/scoring. It requires decoupling of concepts like problem, 
> formulation, algorithm, parameter set, and model, which are missing in MLlib 
> now. MLI implements similar concepts, which we can borrow. There are 
> different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in 
> parallel, depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time 
> (similar to `runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use 
> more cores to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train 
> one or more models on each worker.
> Users should be able to choose which execution mode they want to use. Note 
> that 3) could cover many use cases in practice when the training data is not 
> huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss 
> the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2199:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Denis Turdakov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4314) Exception throws when finding new files like intermediate result(_COPYING_ file) through hdfs interface

2014-11-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205477#comment-14205477
 ] 

Sean Owen commented on SPARK-4314:
--

To clarify, the suffix is _COPYING_ right? Yeah that's the standard naming for 
a destination file during a copy. It's renamed on completion. I think you're 
seeing that the file was present when the dir was listed but not when it was 
subsequently read. the file could also have been encountered in an intermediate 
partly written state.

Normally I'd say "just don't do that" but the point of streaming is to read a 
data source that is being written to with new files. I think this and other 
streaming processes that read HDFS should filter out "*._COPYING_" files, yes.

> Exception throws when finding new files like intermediate result(_COPYING_ 
> file) through hdfs interface
> ---
>
> Key: SPARK-4314
> URL: https://issues.apache.org/jira/browse/SPARK-4314
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>
> [Reproduce]
>  1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))"
>  2. Upload file to hdfs(reason as followings)
>  3. Exception as followings.
> [Exception stack]
>  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
> master/192.168.84.142:9000 from ocdc sending #13
>  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
> 1415611274000 ms
>  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: hdfs://master:9000/user/spark/200.COPYING
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>  at scala.Option.getOrElse(Option.scala:120)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at 
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
>  at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>  at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>  at 
> org.apache.spark.streaming

[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3181:
-
Assignee: Fan Jiang

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2309:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Generalize the binary logistic regression into multinomial logistic regression
> --
>
> Key: SPARK-2309
> URL: https://issues.apache.org/jira/browse/SPARK-2309
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> Currently, there is no multi-class classifier in mllib. Logistic regression 
> can be extended to multinomial one straightforwardly. 
> The following formula will be implemented. 
> http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3181:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data

2014-11-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3218:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> K-Means clusterer can fail on degenerate data
> -
>
> Key: SPARK-3218
> URL: https://issues.apache.org/jira/browse/SPARK-3218
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>
> The KMeans parallel implementation selects points to be cluster centers with 
> probability weighted by their distance to cluster centers.  However, if there 
> are fewer than k DISTINCT points in the data set, this approach will fail.  
> Further, the recent checkin to work around this problem results in selection 
> of the same point repeatedly as a cluster center. 
> The fix is to allow fewer than k cluster centers to be selected.  This 
> requires several changes to the code, as the number of cluster centers is 
> woven into the implementation.
> I have a version of the code that addresses this problem, AND generalizes the 
> distance metric.  However, I see that there are literally hundreds of 
> outstanding pull requests.  If someone will commit to working with me to 
> sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

2014-11-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205457#comment-14205457
 ] 

Joseph K. Bradley commented on SPARK-4328:
--

both related to Python API SerDe updates

> Python serialization updates make Python ML API more brittle to types
> -
>
> Key: SPARK-4328
> URL: https://issues.apache.org/jira/browse/SPARK-4328
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> In Spark 1.1, you could create a LabeledPoint with labels specified as 
> integers, and then use it with LinearRegression.  This was broken by the 
> Python API updates since then.  E.g., this code runs in the 1.1 branch but 
> not in the current master:
> {code}
> from pyspark.mllib.regression import *
> import numpy
> features = numpy.ndarray((3))
> data = sc.parallelize([LabeledPoint(1, features)])
> LinearRegressionWithSGD.train(data)
> {code}
> Recommendation: Allow users to use integers from Python.
> The error message you get is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o55.trainLinearRegressionModelWithSGD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
> (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
> be cast to java.lang.Double
>   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
>   at 
> org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types

2014-11-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205458#comment-14205458
 ] 

Joseph K. Bradley commented on SPARK-4328:
--

[~atalwalkar]  Thanks for pointing this out!

> Python serialization updates make Python ML API more brittle to types
> -
>
> Key: SPARK-4328
> URL: https://issues.apache.org/jira/browse/SPARK-4328
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> In Spark 1.1, you could create a LabeledPoint with labels specified as 
> integers, and then use it with LinearRegression.  This was broken by the 
> Python API updates since then.  E.g., this code runs in the 1.1 branch but 
> not in the current master:
> {code}
> from pyspark.mllib.regression import *
> import numpy
> features = numpy.ndarray((3))
> data = sc.parallelize([LabeledPoint(1, features)])
> LinearRegressionWithSGD.train(data)
> {code}
> Recommendation: Allow users to use integers from Python.
> The error message you get is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o55.trainLinearRegressionModelWithSGD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 
> (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot 
> be cast to java.lang.Double
>   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
>   at 
> org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
>   at 
> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >