[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext

2015-03-17 Thread Youssef Hatem (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365274#comment-14365274
 ] 

Youssef Hatem commented on SPARK-6320:
--

Thank you Michael for your response,

Actually I am trying to extend SQL syntax with custom constructs which would 
require altering the workflow of {{SparkSQL}} all the way starting from the 
lexer and ending with the optimization and physical planning.

I thought this should be possible without having to extend many classes. 
However not being able to use {{planLater}} forces me to do seemingly 
unnecessary extension of {{SparkPlanner}} specially that there seem to be some 
logic to handle these scenarios (i.e. the {{extraStrategies}} sequence).

 Adding new query plan strategy to SQLContext
 

 Key: SPARK-6320
 URL: https://issues.apache.org/jira/browse/SPARK-6320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Youssef Hatem
Priority: Minor

 Hi,
 I would like to add a new strategy to {{SQLContext}}. To do this I created a 
 new class which extends {{Strategy}}. In my new class I need to call 
 {{planLater}} function. However this method is defined in {{SparkPlanner}} 
 (which itself inherits the method from {{QueryPlanner}}).
 To my knowledge the only way to make {{planLater}} function visible to my new 
 strategy is to define my strategy inside another class that extends 
 {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
 have to extend the {{SQLContext}} such that I can override the {{planner}} 
 field with the new {{Planner}} class I created.
 It seems that this is a design problem because adding a new strategy seems to 
 require extending {{SQLContext}} (unless I am doing it wrong and there is a 
 better way to do it).
 Thanks a lot,
 Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365307#comment-14365307
 ] 

Yin Huai commented on SPARK-6250:
-

I see. The problem is that the data type string returned by Hive does not have 
backticks. Will take care it soon.

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler

2015-03-17 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-6394:
--

 Summary: cleanup BlockManager companion object and improve the 
getCacheLocs method in DAGScheduler
 Key: SPARK-6394
 URL: https://issues.apache.org/jira/browse/SPARK-6394
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Priority: Minor


The current implementation of getCacheLocs include searching a HashMap many 
times, we can avoid this.
And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove it. 
Also we can combine BlockManager.blockIdsToHosts and blockIdsToBlockManagers 
into a single method in order to remove some unnecessary layers of indirection 
/ collection creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366575#comment-14366575
 ] 

Michael Armbrust commented on SPARK-6200:
-

Thanks for taking the time to explain your design!  I guess at a high level I 
don't see the value in having a configurable DialectManager.  Perhaps I am 
missing some use case.

That said, what do you think about the simpler implementation that I linked to 
below?  Would that suite your use case?

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366622#comment-14366622
 ] 

Apache Spark commented on SPARK-6394:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/5043

 cleanup BlockManager companion object and improve the getCacheLocs method in 
 DAGScheduler
 -

 Key: SPARK-6394
 URL: https://issues.apache.org/jira/browse/SPARK-6394
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Priority: Minor

 The current implementation of getCacheLocs include searching a HashMap many 
 times, we can avoid this.
 And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove 
 it. Also we can combine BlockManager.blockIdsToHosts and 
 blockIdsToBlockManagers into a single method in order to remove some 
 unnecessary layers of indirection / collection creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6145) ORDER BY fails to resolve nested fields

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6145:

Target Version/s: 1.3.1  (was: 1.3.0)

 ORDER BY fails to resolve nested fields
 ---

 Key: SPARK-6145
 URL: https://issues.apache.org/jira/browse/SPARK-6145
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.3.0


 {code}
 sqlContext.jsonRDD(sc.parallelize(
   {a: {b: 1}, c: 1} :: Nil)).registerTempTable(nestedOrder)
 // Works
 sqlContext.sql(SELECT 1 FROM nestedOrder ORDER BY c)
 // Fails now
 sqlContext.sql(SELECT 1 FROM nestedOrder ORDER BY a.b)
 // Fails now
 sqlContext.sql(SELECT a.b FROM nestedOrder ORDER BY a.b)
 {code}
 Relatedly the error message for bad get fields should also include the name 
 of the field in question.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6299.

   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0
 Assignee: Kevin (Sangwoo) Kim

 ClassNotFoundException in standalone mode when running groupByKey with class 
 defined in REPL.
 -

 Key: SPARK-6299
 URL: https://issues.apache.org/jira/browse/SPARK-6299
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0, 1.2.1
Reporter: Kevin (Sangwoo) Kim
Assignee: Kevin (Sangwoo) Kim
 Fix For: 1.4.0, 1.3.1


 Anyone can reproduce this issue by the code below
 (runs well in local mode, got exception with clusters)
 (it runs well in Spark 1.1.1)
 case class ClassA(value: String)
 val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) ))
 rdd.groupByKey.collect
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
 java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
 at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 

[jira] [Updated] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6299:
---
Description: 
Anyone can reproduce this issue by the code below
(runs well in local mode, got exception with clusters)
(it runs well in Spark 1.1.1)

{code}
case class ClassA(value: String)
val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) ))
rdd.groupByKey.collect
{code}


{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in 
stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 1.0 
(TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at 

[jira] [Resolved] (SPARK-6044) RDD.aggregate() should not use the closure serializer on the zero value

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6044.

Resolution: Fixed
  Assignee: Sean Owen

 RDD.aggregate() should not use the closure serializer on the zero value
 ---

 Key: SPARK-6044
 URL: https://issues.apache.org/jira/browse/SPARK-6044
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Matt Cheah
Assignee: Sean Owen
 Fix For: 1.4.0


 PairRDDFunctions.aggregateByKey() correctly uses 
 SparkEnv.get.serializer.newInstance() to serialize the zero value. It seems 
 this logic is not mirrored in RDD.aggregate(), which computes the aggregation 
 and returns the aggregation directly at the driver. We should change 
 RDD.aggregate() to make this consistent; I ran into some serialization errors 
 because I was expecting RDD.aggregate() to Kryo serialize the zero value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6085) Increase default value for memory overhead

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364740#comment-14364740
 ] 

Apache Spark commented on SPARK-6085:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/5065

 Increase default value for memory overhead
 --

 Key: SPARK-6085
 URL: https://issues.apache.org/jira/browse/SPARK-6085
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor
 Fix For: 1.4.0


 Several users have communicated how current default memory overhead value 
 resulted in failed computation in Spark on YARN.
 See this thread:
 http://search-hadoop.com/m/JW1q58FDel
 Increasing default value for memory overhead would improve out of box user 
 experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-17 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364764#comment-14364764
 ] 

Antony Mayi commented on SPARK-6334:


users: 12.5 millions
ratings: 3.3 billions
rank: 50
iters: 15

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Marko Bonaci (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364700#comment-14364700
 ] 

Marko Bonaci commented on SPARK-6370:
-

Isn't the {{fraction}} parameter the size of sample (percentage of elements you 
want to sample from the calling RDD)?
Cuz' without replacement works as though it is.
If not, could you please explain?

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs

2015-03-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364703#comment-14364703
 ] 

Reynold Xin commented on SPARK-5954:


Shouldn't this just be an API in dataframe?

 Add topByKey to pair RDDs
 -

 Key: SPARK-5954
 URL: https://issues.apache.org/jira/browse/SPARK-5954
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Shuo Xiang

 `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a 
 pair RDD. This is used, e.g., in computing top recommendations. We can use 
 the Guava implementation of finding top-k from an iterator. See also 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Marko Bonaci (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364700#comment-14364700
 ] 

Marko Bonaci edited comment on SPARK-6370 at 3/17/15 7:17 AM:
--

Isn't the {{fraction}} parameter the size of sample (percentage of elements you 
want to sample from the calling RDD)?
Cuz' without replacement works as though it is.
If not, could you please explain (and perhaps improve the scaladoc)?


was (Author: mbonaci):
Isn't the {{fraction}} parameter the size of sample (percentage of elements you 
want to sample from the calling RDD)?
Cuz' without replacement works as though it is.
If not, could you please explain?

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6380) Resolution of equi-join key in post-join projection

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6380:
---
Description: 
{code}
df1.join(df2, df1(key) === df2(key)).select(key)
{code}

It would be great to just resolve key to df1(key) in the case of inner joins.


  was:
df1.join(df2, df1(key) === df2(key)).select(key)

It would be great to just resolve key to df1(key) in the case of inner joins.



 Resolution of equi-join key in post-join projection
 ---

 Key: SPARK-6380
 URL: https://issues.apache.org/jira/browse/SPARK-6380
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 {code}
 df1.join(df2, df1(key) === df2(key)).select(key)
 {code}
 It would be great to just resolve key to df1(key) in the case of inner 
 joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6281) Support incremental updates for Graph

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364787#comment-14364787
 ] 

Apache Spark commented on SPARK-6281:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/5067

 Support incremental updates for Graph
 -

 Key: SPARK-6281
 URL: https://issues.apache.org/jira/browse/SPARK-6281
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor

 Add api to efficiently append new vertices and edges into existing Graph,
 e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], 
 defaultVertexAttr: VD)
 This is useful for time-evolving graphs; new vertices and edges are built from
 streaming data thru Spark Streaming, and then incrementally appended
 into a existing graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6380) Resolution of equi-join key in post-join projection

2015-03-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6380:
--

 Summary: Resolution of equi-join key in post-join projection
 Key: SPARK-6380
 URL: https://issues.apache.org/jira/browse/SPARK-6380
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


df1.join(df2, df1(key) === df2(key)).select(key)

It would be great to just resolve key to df1(key) in the case of inner joins.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode

2015-03-17 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-6350:

Fix Version/s: 1.3.1
   1.4.0

 Make mesosExecutorCores configurable in mesos fine-grained mode
 -

 Key: SPARK-6350
 URL: https://issues.apache.org/jira/browse/SPARK-6350
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Minor
 Fix For: 1.4.0, 1.3.1


 When spark runs in mesos fine-grained mode, mesos slave launches executor 
 with # of cpus and memories. By the way, # of executor's cores is always 
 CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
 running intensive task, mesos executor always consume 5 cores without any 
 running task. This waste resources. We should set executor core as a 
 configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364661#comment-14364661
 ] 

Apache Spark commented on SPARK-6350:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/5063

 Make mesosExecutorCores configurable in mesos fine-grained mode
 -

 Key: SPARK-6350
 URL: https://issues.apache.org/jira/browse/SPARK-6350
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Minor

 When spark runs in mesos fine-grained mode, mesos slave launches executor 
 with # of cpus and memories. By the way, # of executor's cores is always 
 CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
 running intensive task, mesos executor always consume 5 cores without any 
 running task. This waste resources. We should set executor core as a 
 configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6304) Checkpointing doesn't retain driver port

2015-03-17 Thread Marius Soutier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364713#comment-14364713
 ] 

Marius Soutier edited comment on SPARK-6304 at 3/17/15 7:36 AM:


Got it, thanks. In my tests it was never set automatically, so this must be set 
at some later point.


was (Author: msoutier):
Got it, thanks.

 Checkpointing doesn't retain driver port
 

 Key: SPARK-6304
 URL: https://issues.apache.org/jira/browse/SPARK-6304
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 In a check-pointed Streaming application running on a fixed driver port, the 
 setting spark.driver.port is not loaded when recovering from a checkpoint.
 (The driver is then started on a random port.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6304) Checkpointing doesn't retain driver port

2015-03-17 Thread Marius Soutier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364713#comment-14364713
 ] 

Marius Soutier commented on SPARK-6304:
---

Got it, thanks.

 Checkpointing doesn't retain driver port
 

 Key: SPARK-6304
 URL: https://issues.apache.org/jira/browse/SPARK-6304
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 In a check-pointed Streaming application running on a fixed driver port, the 
 setting spark.driver.port is not loaded when recovering from a checkpoint.
 (The driver is then started on a random port.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364712#comment-14364712
 ] 

Apache Spark commented on SPARK-5523:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5064

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6390) Add MatrixUDT in PySpark

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6390:
-
Component/s: PySpark

 Add MatrixUDT in PySpark
 

 Key: SPARK-6390
 URL: https://issues.apache.org/jira/browse/SPARK-6390
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 After SPARK-6309, we should support MatrixUDT in PySpark too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6390) Add MatrixUDT in PySpark

2015-03-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6390:


 Summary: Add MatrixUDT in PySpark
 Key: SPARK-6390
 URL: https://issues.apache.org/jira/browse/SPARK-6390
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


After SPARK-6309, we should support MatrixUDT in PySpark too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365897#comment-14365897
 ] 

Apache Spark commented on SPARK-5836:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5074

 Highlight in Spark documentation that by default Spark does not delete its 
 temporary files
 --

 Key: SPARK-5836
 URL: https://issues.apache.org/jira/browse/SPARK-5836
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Tomasz Dudziak

 We recently learnt the hard way (in a prod system) that Spark by default does 
 not delete its temporary files until it is stopped. WIthin a relatively short 
 time span of heavy Spark use the disk of our prod machine filled up 
 completely because of multiple shuffle files written to it. We think there 
 should be better documentation around the fact that after a job is finished 
 it leaves a lot of rubbish behind so that this does not come as a surprise.
 Probably a good place to highlight that fact would be the documentation of 
 {{spark.local.dir}} property, which controls where Spark temporary files are 
 written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365918#comment-14365918
 ] 

Xiangrui Meng commented on SPARK-6192:
--

[~MechCoder] Please be a little (but not too) specific in the proposal. For 
example, you should mention Python in the title of the proposal, which sets the 
theme of the project. Scala/Java will be definitely involved, but the goal is 
to have a better coverage of MLlib's Python API. This also helps reviewers 
understand the scope the proposal and rate it.

You should also mention in the proposal that if the features are implemented by 
others, we will create new tasks within the theme of the project. So it is good 
for both MLlib and GSoC.

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6116) Making DataFrame API non-experimental

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6116:
---
Comment: was deleted

(was: User 'azagrebin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5069)

 Making DataFrame API non-experimental
 -

 Key: SPARK-6116
 URL: https://issues.apache.org/jira/browse/SPARK-6116
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 An umbrella ticket to track improvements and changes needed to make DataFrame 
 API non-experimental.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Marko Bonaci (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marko Bonaci updated SPARK-6370:

  Priority: Minor  (was: Major)
Issue Type: Documentation  (was: Bug)

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
Priority: Minor
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366081#comment-14366081
 ] 

Apache Spark commented on SPARK-5954:
-

User 'coderxiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5075

 Add topByKey to pair RDDs
 -

 Key: SPARK-5954
 URL: https://issues.apache.org/jira/browse/SPARK-5954
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Shuo Xiang

 `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a 
 pair RDD. This is used, e.g., in computing top recommendations. We can use 
 the Guava implementation of finding top-k from an iterator. See also 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5955) Add checkpointInterval to ALS

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366127#comment-14366127
 ] 

Apache Spark commented on SPARK-5955:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5076

 Add checkpointInterval to ALS
 -

 Key: SPARK-5955
 URL: https://issues.apache.org/jira/browse/SPARK-5955
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We should add checkpoint interval to ALS to prevent the following:
 1. storing large shuffle files
 2. stack overflow (SPARK-1106)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6382:
-
Description: 
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi


  was:
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to done by 
registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi



 withUDF(...) {...} for supporting temporary UDF definitions in the scope
 

 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Currently the scope of UDF registration is global. It's unsuitable for 
 libraries that's built on top of DataFrame, as many operations has to be done 
 by registering a UDF first.
 Please provide a way for binding temporary UDFs.
 e.g.
 {code}
 withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = 
 m2 ++ m2),
 ...) {
   sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
 }
 {code}
 Also UDF registry is a mutable Hashmap, refactoring it to a immutable one 
 makes more sense.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6382:
-
Description: 
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that are built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi


  was:
Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to be done 
by registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi



 withUDF(...) {...} for supporting temporary UDF definitions in the scope
 

 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Currently the scope of UDF registration is global. It's unsuitable for 
 libraries that are built on top of DataFrame, as many operations has to be 
 done by registering a UDF first.
 Please provide a way for binding temporary UDFs.
 e.g.
 {code}
 withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = 
 m2 ++ m2),
 ...) {
   sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
 }
 {code}
 Also UDF registry is a mutable Hashmap, refactoring it to a immutable one 
 makes more sense.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6381:
-
 Target Version/s:   (was: 1.4.0)
Affects Version/s: (was: 1.3.1)

 add Apriori algorithm to MLLib
 --

 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhangyouhua

 [~mengxr]
 There are many algorithms about association rule mining,for example FPGrowth, 
 Apriori and so on.these algorithms are classic 
 algorithms in machine learning, and there are very much usefully in big data 
 mining. Even the FPGrowth algorithm in spark 
 1.3 version have implementation to solution big big data set, but it need 
 create FPTree before mining frequent item. so 
 while transition data is smaller and the data is sparse and minSupport is 
 bigger,wen can select Apriori  algorithms. 
 how Apriori algorithm parallelism?
 1.Generates frequent items by filtering the input data using minimal support 
 level.
   private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
 Long,partitioner: Partitioner): Array[Item]
 2.Generate frequent itemSets by building apriori, the extraction is done on 
 each partition.
  2.1 create candidateSet by kFreqItems and k
  private def createCandidateSet[Item: ClassTag]( kFreqItems: 
 Array[(Array[Item], Long)], k: Int)
  2.2 create kFreqItems from candidateSet is generated by candidateSet
  private def scanDataSet[Item: ClassTag](dataSet: 
 RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
 RDD[(Array[Item], Long)]
  2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread zhangyouhua (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangyouhua updated SPARK-6381:
---
 Target Version/s: 1.3.1
Affects Version/s: 1.3.0
Fix Version/s: 1.3.1

 add Apriori algorithm to MLLib
 --

 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhangyouhua
 Fix For: 1.3.1


 [~mengxr]
 There are many algorithms about association rule mining,for example FPGrowth, 
 Apriori and so on.these algorithms are classic 
 algorithms in machine learning, and there are very much usefully in big data 
 mining. Even the FPGrowth algorithm in spark 
 1.3 version have implementation to solution big big data set, but it need 
 create FPTree before mining frequent item. so 
 while transition data is smaller and the data is sparse and minSupport is 
 bigger,wen can select Apriori  algorithms. 
 how Apriori algorithm parallelism?
 1.Generates frequent items by filtering the input data using minimal support 
 level.
   private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
 Long,partitioner: Partitioner): Array[Item]
 2.Generate frequent itemSets by building apriori, the extraction is done on 
 each partition.
  2.1 create candidateSet by kFreqItems and k
  private def createCandidateSet[Item: ClassTag]( kFreqItems: 
 Array[(Array[Item], Long)], k: Int)
  2.2 create kFreqItems from candidateSet is generated by candidateSet
  private def scanDataSet[Item: ClassTag](dataSet: 
 RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
 RDD[(Array[Item], Long)]
  2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364847#comment-14364847
 ] 

Sean Owen commented on SPARK-6381:
--

See SPARK-2432. I don't think apriori is as useful or used as FP-growth

 add Apriori algorithm to MLLib
 --

 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhangyouhua

 [~mengxr]
 There are many algorithms about association rule mining,for example FPGrowth, 
 Apriori and so on.these algorithms are classic 
 algorithms in machine learning, and there are very much usefully in big data 
 mining. Even the FPGrowth algorithm in spark 
 1.3 version have implementation to solution big big data set, but it need 
 create FPTree before mining frequent item. so 
 while transition data is smaller and the data is sparse and minSupport is 
 bigger,wen can select Apriori  algorithms. 
 how Apriori algorithm parallelism?
 1.Generates frequent items by filtering the input data using minimal support 
 level.
   private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
 Long,partitioner: Partitioner): Array[Item]
 2.Generate frequent itemSets by building apriori, the extraction is done on 
 each partition.
  2.1 create candidateSet by kFreqItems and k
  private def createCandidateSet[Item: ClassTag]( kFreqItems: 
 Array[(Array[Item], Long)], k: Int)
  2.2 create kFreqItems from candidateSet is generated by candidateSet
  private def scanDataSet[Item: ClassTag](dataSet: 
 RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
 RDD[(Array[Item], Long)]
  2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-17 Thread Nick Bruun (JIRA)
Nick Bruun created SPARK-6385:
-

 Summary: ISO 8601 timestamp parsing does not support arbitrary 
precision second fractions
 Key: SPARK-6385
 URL: https://issues.apache.org/jira/browse/SPARK-6385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Nick Bruun
Priority: Minor


The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
not support arbitrary precision fractions of seconds, only millisecond 
precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
{{2015-02-02T00:00:07.9000GMT-00:00}} will fail.

I'm willing to implement a fix, but pointers on the direction would be 
appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364859#comment-14364859
 ] 

Sean Owen commented on SPARK-6370:
--

Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply 
to me that it's the size of the sample as a fraction of the total. In fact it's 
the probability that each element is chosen when without replacement, and an 
expected number of times each element is chosen when with replacement.

I think it needs a doc update. Would you like to open a PR to elaborate the 
javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to 
doc the {{RandomSampler}} subclasses too.

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-03-17 Thread Qiping Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364803#comment-14364803
 ] 

Qiping Li commented on SPARK-4590:
--

Hi, [~rezazadeh], thanks for your investigation. It is very helpful to have 
parameter server integrated into Spark. How is it going with this issue now?

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6381.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.4.0)

(Don't set fix version, and 1.3.1 does not exist.)
Search JIRA first please. This was already implemented in SPARK-4001 as 
FP-growth. See also SPARK-2432.

 add Apriori algorithm to MLLib
 --

 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhangyouhua

 [~mengxr]
 There are many algorithms about association rule mining,for example FPGrowth, 
 Apriori and so on.these algorithms are classic 
 algorithms in machine learning, and there are very much usefully in big data 
 mining. Even the FPGrowth algorithm in spark 
 1.3 version have implementation to solution big big data set, but it need 
 create FPTree before mining frequent item. so 
 while transition data is smaller and the data is sparse and minSupport is 
 bigger,wen can select Apriori  algorithms. 
 how Apriori algorithm parallelism?
 1.Generates frequent items by filtering the input data using minimal support 
 level.
   private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
 Long,partitioner: Partitioner): Array[Item]
 2.Generate frequent itemSets by building apriori, the extraction is done on 
 each partition.
  2.1 create candidateSet by kFreqItems and k
  private def createCandidateSet[Item: ClassTag]( kFreqItems: 
 Array[(Array[Item], Long)], k: Int)
  2.2 create kFreqItems from candidateSet is generated by candidateSet
  private def scanDataSet[Item: ClassTag](dataSet: 
 RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
 RDD[(Array[Item], Long)]
  2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6383) Few examples on Dataframe operation give compiler errors

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364825#comment-14364825
 ] 

Apache Spark commented on SPARK-6383:
-

User 'tijoparacka' has created a pull request for this issue:
https://github.com/apache/spark/pull/5068

 Few examples on Dataframe operation give compiler errors 
 -

 Key: SPARK-6383
 URL: https://issues.apache.org/jira/browse/SPARK-6383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial

 The below statements give compiler errors as 
 a) the select method doesnot accept String, Column 
 df.select(name, df(age) + 1).show() // Need to convert String to Column
 b) Filtering should be based on age  not on name  Column
 df.filter(df(name)  21).show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2015-03-17 Thread Rex Xiong (JIRA)
Rex Xiong created SPARK-6384:


 Summary: saveAsParquet doesn't clean up attempt_* folders
 Key: SPARK-6384
 URL: https://issues.apache.org/jira/browse/SPARK-6384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rex Xiong


After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
_SUCCESS, _common_metadata, _metadata files successfully.
But sometimes, there will be some attempt_* folder (e.g. 
attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) 
under the same folder, it contains one parquet file, seems to be a working temp 
folder.
It happens even though _SUCCESS file created.
In this situation, SparkSQL throws exception when loading this parquet folder:
Error: java.io.FileNotFoundException: Path is not a file: 
../attempt_201503170229_0006_r_06_736
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:69)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:55)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
UpdateTimes(FSNamesystem.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
Int(FSNamesystem.java:1671)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1651)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1625)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
tions(NameNodeRpcServer.java:503)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
2)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
l(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
tion.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co
de=0)

I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2015-03-17 Thread Rex Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rex Xiong updated SPARK-6384:
-
Description: 
After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
_SUCCESS, _common_metadata, _metadata files successfully.
But sometimes, there will be some attempt_* folder (e.g. 
attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) 
under the same folder, it contains one parquet file, seems to be a working temp 
folder.
It happens even though _SUCCESS file created.
In this situation, SparkSQL (Hive table) throws exception when loading this 
parquet folder:
Error: java.io.FileNotFoundException: Path is not a file: 
../attempt_201503170229_0006_r_06_736
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:69)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:55)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
UpdateTimes(FSNamesystem.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
Int(FSNamesystem.java:1671)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1651)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1625)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
tions(NameNodeRpcServer.java:503)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
2)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
l(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
tion.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co
de=0)

I'm not sure whether it's a Spark bug or a Parquet bug.

  was:
After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
_SUCCESS, _common_metadata, _metadata files successfully.
But sometimes, there will be some attempt_* folder (e.g. 
attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) 
under the same folder, it contains one parquet file, seems to be a working temp 
folder.
It happens even though _SUCCESS file created.
In this situation, SparkSQL throws exception when loading this parquet folder:
Error: java.io.FileNotFoundException: Path is not a file: 
../attempt_201503170229_0006_r_06_736
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:69)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
va:55)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
UpdateTimes(FSNamesystem.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
Int(FSNamesystem.java:1671)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1651)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
(FSNamesystem.java:1625)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
tions(NameNodeRpcServer.java:503)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
2)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
l(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
tion.java:1594)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co
de=0)

I'm not sure whether it's a Spark bug or a Parquet bug.


 saveAsParquet doesn't clean up attempt_* folders
 

[jira] [Created] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope

2015-03-17 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6382:


 Summary: withUDF(...) {...} for supporting temporary UDF 
definitions in the scope
 Key: SPARK-6382
 URL: https://issues.apache.org/jira/browse/SPARK-6382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


Currently the scope of UDF registration is global. It's unsuitable for 
libraries that's built on top of DataFrame, as many operations has to done by 
registering a UDF first.

Please provide a way for binding temporary UDFs.

e.g.

{code}
withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 
++ m2),
...) {
  sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id)
}
{code}

Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes 
more sense.

Jianshi




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6381:
-
Target Version/s:   (was: 1.3.1)
   Fix Version/s: (was: 1.3.1)

@zhangyouhua please do not set fix version, as this is not 'fixed', but a 
duplicate. Please don't set target version either as it is resolved.

 add Apriori algorithm to MLLib
 --

 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhangyouhua

 [~mengxr]
 There are many algorithms about association rule mining,for example FPGrowth, 
 Apriori and so on.these algorithms are classic 
 algorithms in machine learning, and there are very much usefully in big data 
 mining. Even the FPGrowth algorithm in spark 
 1.3 version have implementation to solution big big data set, but it need 
 create FPTree before mining frequent item. so 
 while transition data is smaller and the data is sparse and minSupport is 
 bigger,wen can select Apriori  algorithms. 
 how Apriori algorithm parallelism?
 1.Generates frequent items by filtering the input data using minimal support 
 level.
   private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
 Long,partitioner: Partitioner): Array[Item]
 2.Generate frequent itemSets by building apriori, the extraction is done on 
 each partition.
  2.1 create candidateSet by kFreqItems and k
  private def createCandidateSet[Item: ClassTag]( kFreqItems: 
 Array[(Array[Item], Long)], k: Int)
  2.2 create kFreqItems from candidateSet is generated by candidateSet
  private def scanDataSet[Item: ClassTag](dataSet: 
 RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
 RDD[(Array[Item], Long)]
  2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6370:
-
Comment: was deleted

(was: What's the bug? Each element is sampled with probability 0.5. I think the
expected size is 14 but not all samples would be that size.

)

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6377) Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation.

2015-03-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364901#comment-14364901
 ] 

Sean Owen commented on SPARK-6377:
--

I think this is the same basic suggestion as SPARK-4630?

 Set the number of shuffle partitions automatically based on the size of input 
 tables and the reduce-side operation.
 ---

 Key: SPARK-6377
 URL: https://issues.apache.org/jira/browse/SPARK-6377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai

 It will be helpful to automatically set the number of shuffle partitions 
 based on the size of input tables and the operation at the reduce side for an 
 Exchange operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6381) add Apriori algorithm to MLLib

2015-03-17 Thread zhangyouhua (JIRA)
zhangyouhua created SPARK-6381:
--

 Summary: add Apriori algorithm to MLLib
 Key: SPARK-6381
 URL: https://issues.apache.org/jira/browse/SPARK-6381
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.1
Reporter: zhangyouhua
 Fix For: 1.4.0


[~mengxr]
There are many algorithms about association rule mining,for example FPGrowth, 
Apriori and so on.these algorithms are classic 

algorithms in machine learning, and there are very much usefully in big data 
mining. Even the FPGrowth algorithm in spark 

1.3 version have implementation to solution big big data set, but it need 
create FPTree before mining frequent item. so 

while transition data is smaller and the data is sparse and minSupport is 
bigger,wen can select Apriori  algorithms. 
how Apriori algorithm parallelism?
1.Generates frequent items by filtering the input data using minimal support 
level.
  private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: 
Long,partitioner: Partitioner): Array[Item]
2.Generate frequent itemSets by building apriori, the extraction is done on 
each partition.
 2.1 create candidateSet by kFreqItems and k
 private def createCandidateSet[Item: ClassTag]( kFreqItems: 
Array[(Array[Item], Long)], k: Int)
 2.2 create kFreqItems from candidateSet is generated by candidateSet
 private def scanDataSet[Item: ClassTag](dataSet: 
RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): 
RDD[(Array[Item], Long)]
 2.3 filter dataSet by candidateSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6383) Few examples on Dataframe operation give compiler errors

2015-03-17 Thread Tijo Thomas (JIRA)
Tijo Thomas created SPARK-6383:
--

 Summary: Few examples on Dataframe operation give compiler errors 
 Key: SPARK-6383
 URL: https://issues.apache.org/jira/browse/SPARK-6383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial


The below statements give compiler errors as 
a) the select method doesnot accept String, Column 
df.select(name, df(age) + 1).show() // Need to convert String to Column

b) Filtering should be based on age  not on name  Column
df.filter(df(name)  21).show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364859#comment-14364859
 ] 

Sean Owen edited comment on SPARK-6370 at 3/17/15 10:09 AM:


Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply 
to me that it's the size of the sample as a fraction of the total. In fact it's 
the probability that each element is chosen when without replacement, and an 
expected number of times each element is chosen when with replacement. EDIT: 
... which TBC in both cases also means that the *expected* size of the sample 
is the given fraction of the input size.

I think it needs a doc update. Would you like to open a PR to elaborate the 
javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to 
doc the {{RandomSampler}} subclasses too.


was (Author: srowen):
Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply 
to me that it's the size of the sample as a fraction of the total. In fact it's 
the probability that each element is chosen when without replacement, and an 
expected number of times each element is chosen when with replacement.

I think it needs a doc update. Would you like to open a PR to elaborate the 
javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to 
doc the {{RandomSampler}} subclasses too.

 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2015-03-17 Thread Rex Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rex Xiong updated SPARK-6384:
-
Affects Version/s: 1.2.1

 saveAsParquet doesn't clean up attempt_* folders
 

 Key: SPARK-6384
 URL: https://issues.apache.org/jira/browse/SPARK-6384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Rex Xiong

 After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
 _SUCCESS, _common_metadata, _metadata files successfully.
 But sometimes, there will be some attempt_* folder (e.g. 
 attempt_201503170229_0006_r_06_736, 
 attempt_201503170229_0006_r_000404_416) under the same folder, it contains 
 one parquet file, seems to be a working temp folder.
 It happens even though _SUCCESS file created.
 In this situation, SparkSQL throws exception when loading this parquet folder:
 Error: java.io.FileNotFoundException: Path is not a file: 
 ../attempt_201503170229_0006_r_06_736
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:69)
 at 
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
 va:55)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 UpdateTimes(FSNamesystem.java:1728)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 Int(FSNamesystem.java:1671)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1651)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
 (FSNamesystem.java:1625)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
 tions(NameNodeRpcServer.java:503)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
 nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
 2)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
 ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
 l(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
 tion.java:1594)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
 (state=,co
 de=0)
 I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6349) Add probability estimates in SVMModel predict result

2015-03-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364888#comment-14364888
 ] 

Sean Owen commented on SPARK-6349:
--

One is just a monotonic function of the other. It doesn't change the problem at 
all in that respect.

 Add probability estimates in SVMModel predict result
 

 Key: SPARK-6349
 URL: https://issues.apache.org/jira/browse/SPARK-6349
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
   Original Estimate: 168h
  Remaining Estimate: 168h

 In SVMModel, predictPoint method output raw margin(threshold not set) or 1/0 
 label(threshold set). 
 when SVM are used as a classifier, it's hard to find a good threshold,and the 
 raw margin is hard to understand. 
 when I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. I have to set threshold to -1 to get a reasonable confusion matrix.
 So, I suggest to provide probability predict result in SVMModel as in 
 libSVM(Platt's binary SVM Probablistic Output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-03-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364813#comment-14364813
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

[~pwendell] Hi, obviously there's nobody looking into that issue. Could you 
please clarify, assign, or whatever it takes to get this issue handled in a 
future version of spark?

That'd be great! Thanks!!!

:)

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5332) Efficient way to deal with ExecutorLost

2015-03-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5332.
--
Resolution: Won't Fix

 Efficient way to deal with ExecutorLost
 ---

 Key: SPARK-5332
 URL: https://issues.apache.org/jira/browse/SPARK-5332
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Liang-Chi Hsieh
Priority: Minor

 Currently, the handler of the case when an executor being lost in 
 DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a 
 bit of extra information to Stage class to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6386) add association rule mining algorithm to MLLib

2015-03-17 Thread zhangyouhua (JIRA)
zhangyouhua created SPARK-6386:
--

 Summary: add association rule mining algorithm to MLLib
 Key: SPARK-6386
 URL: https://issues.apache.org/jira/browse/SPARK-6386
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: zhangyouhua


[~mengxr]
association rule algorithm is find frequent items which are association,while 
given transition data set and minSupport and minConf. we can use FPGrowth 
algorithm mining frequent pattern item,but can not explain each other. so we 
should add association rule algorithm.

for example:
data set:
A B C
A C
A D
B E F
minSupport :0.5
minConf:0.5

the  frequent items- support 
A -0.75
B -0.5
C -0.5 
A C -0.5

use minSupport calculate minConf:
A - C: support {A C}/support {A}  = 0.67




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6117) describe function for summary statistics

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365781#comment-14365781
 ] 

Apache Spark commented on SPARK-6117:
-

User 'azagrebin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5073

 describe function for summary statistics
 

 Key: SPARK-6117
 URL: https://issues.apache.org/jira/browse/SPARK-6117
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 DataFrame.describe should return a DataFrame with summary statistics. 
 {code}
 def describe(cols: String*): DataFrame
 {code}
 If cols is empty, then run describe on all numeric columns.
 The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and 
 n + 1 columns. The 1st column is the name of the aggregate function, and the 
 next n columns are the numeric columns of interest in the input DataFrame.
 Similar to Pandas (but removing percentile since accurate percentiles are too 
 expensive to compute for Big Data)
 {code}
 In [19]: df.describe()
 Out[19]: 
   A B C D
 count  6.00  6.00  6.00  6.00
 mean   0.073711 -0.431125 -0.687758 -0.233103
 std0.843157  0.922818  0.779887  0.973118
 min   -0.861849 -2.104569 -1.509059 -1.135632
 max1.212112  0.567020  0.276232  1.071804
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6336:
-
Target Version/s: 1.4.0, 1.3.1  (was: 1.4.0)

 LBFGS should document what convergenceTol means
 ---

 Key: SPARK-6336
 URL: https://issues.apache.org/jira/browse/SPARK-6336
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki
Priority: Trivial
 Fix For: 1.4.0, 1.3.1


 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance.  We 
 should document that convergenceTol is relative and ensure in a unit test 
 that this behavior does not change in Breeze without us realizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40

2015-03-17 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365758#comment-14365758
 ] 

Michael Malak commented on SPARK-6388:
--

Isn't it Hadoop 2.7 that is supposed to provide Java 8 compatibility?

 Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40
 --

 Key: SPARK-6388
 URL: https://issues.apache.org/jira/browse/SPARK-6388
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Submit, YARN
Affects Versions: 1.3.0
 Environment: 1. Linux version 3.16.0-30-generic (buildd@komainu) (gcc 
 version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #40-Ubuntu SMP Mon Jan 12 22:06:37 
 UTC 2015
 2. Oracle Java 8 update 40  for Linux X64
 3. Scala 2.10.5
 4. Hadoop 2.6 (pre-build version)
Reporter: John
   Original Estimate: 24h
  Remaining Estimate: 24h

 I build Apache Spark 1.3 munally.
 ---
 JAVA_HOME=PATH_TO_JAVA8
 mvn clean package -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests
 ---
 Something goes wrong, akka always tell me 
 ---
 15/03/17 21:28:10 WARN remote.ReliableDeliverySupervisor: Association with 
 remote system [akka.tcp://sparkYarnAM@Server2:42161] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 ---
 I build another version of Spark 1.3 + Hadoop 2.6 under Java 7.
 Everything goes well.
 Logs
 ---
 15/03/17 21:27:06 INFO spark.SparkContext: Running Spark version 1.3.0
 15/03/17 21:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 15/03/17 21:27:08 INFO spark.SecurityManager: Changing view Servers to: hduser
 15/03/17 21:27:08 INFO spark.SecurityManager: Changing modify Servers to: 
 hduser
 15/03/17 21:27:08 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui Servers disabled; users with view permissions: Set(hduser); 
 users with modify permissions: Set(hduser)
 15/03/17 21:27:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/03/17 21:27:08 INFO Remoting: Starting remoting
 15/03/17 21:27:09 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@Server3:37951]
 15/03/17 21:27:09 INFO util.Utils: Successfully started service 'sparkDriver' 
 on port 37951.
 15/03/17 21:27:09 INFO spark.SparkEnv: Registering MapOutputTracker
 15/03/17 21:27:09 INFO spark.SparkEnv: Registering BlockManagerMaster
 15/03/17 21:27:09 INFO storage.DiskBlockManager: Created local directory at 
 /tmp/spark-0db692bb-cd02-40c8-a8f0-3813c6da18e2/blockmgr-a1d0ad23-ab76-4177-80a0-a6f982a64d80
 15/03/17 21:27:09 INFO storage.MemoryStore: MemoryStore started with capacity 
 265.1 MB
 15/03/17 21:27:09 INFO spark.HttpFileServer: HTTP File server directory is 
 /tmp/spark-502ef3f8-b8cd-45cf-b1df-97df297cdb35/httpd-6303e24d-4b2b-4614-bb1d-74e8d331189b
 15/03/17 21:27:09 INFO spark.HttpServer: Starting HTTP Server
 15/03/17 21:27:09 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/17 21:27:10 INFO server.AbstractConnector: Started 
 SocketConnector@0.0.0.0:48000
 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'HTTP file 
 server' on port 48000.
 15/03/17 21:27:10 INFO spark.SparkEnv: Registering OutputCommitCoordinator
 15/03/17 21:27:10 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/17 21:27:10 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:4040
 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'SparkUI' on 
 port 4040.
 15/03/17 21:27:10 INFO ui.SparkUI: Started SparkUI at http://Server3:4040
 15/03/17 21:27:10 INFO spark.SparkContext: Added JAR 
 file:/home/hduser/spark-java2.jar at 
 http://192.168.11.42:48000/jars/spark-java2.jar with timestamp 1426598830307
 15/03/17 21:27:10 INFO client.RMProxy: Connecting to ResourceManager at 
 Server3/192.168.11.42:8050
 15/03/17 21:27:11 INFO yarn.Client: Requesting a new application from cluster 
 with 3 NodeManagers
 15/03/17 21:27:11 INFO yarn.Client: Verifying our application has not 
 requested more than the maximum memory capability of the cluster (8192 MB per 
 container)
 15/03/17 21:27:11 INFO yarn.Client: Will allocate AM container, with 896 MB 
 memory including 384 MB overhead
 15/03/17 21:27:11 INFO yarn.Client: Setting up container launch context for 
 our AM
 15/03/17 21:27:11 INFO yarn.Client: Preparing resources for our AM container
 15/03/17 21:27:12 INFO yarn.Client: Uploading resource 
 file:/home/hduser/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar
  - 
 hdfs://Server3:9000/user/hduser/.sparkStaging/application_1426595477608_0002/spark-assembly-1.3.0-hadoop2.6.0.jar
 15/03/17 21:27:21 INFO yarn.Client: Setting up the launch environment for our 
 

[jira] [Closed] (SPARK-6358) Spark-submit error when using PYSPARK_PYTHON enviromnental variable

2015-03-17 Thread dustin davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dustin davidson closed SPARK-6358.
--
Resolution: Invalid

Found a work around.  The issue was with group permissions on the '/home' 
directory.  Not necessarily a bug for Spark.

 Spark-submit error when using PYSPARK_PYTHON enviromnental variable
 ---

 Key: SPARK-6358
 URL: https://issues.apache.org/jira/browse/SPARK-6358
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Using CDH5.3 with Spark 1.2.0
Reporter: dustin davidson

 When using spark-submit the PYSPARK_PYTHON setting throws an error.  I can 
 run the pyspark repl while setting PYSPARK_PYTHON, but spark-submit does not 
 work. 
 Error received when running spark-submit:
 15/03/12 15:25:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 hadoop-manager): java.io.IOException: Cannot run program 
 /home/hadmin/testenv/bin/python: error=13, Permission denied
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: error=13, Permission denied
   at java.lang.UNIXProcess.forkAndExec(Native Method)
   at java.lang.UNIXProcess.init(UNIXProcess.java:186)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
   ... 13 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365735#comment-14365735
 ] 

Yin Huai commented on SPARK-6250:
-

Update: Seems this problem affects struct fields when a field name is using a 
reserved type keyword as the name. For top level column, it is fine.

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5313) Create simple framework for highlighting changes introduced in a PR

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365561#comment-14365561
 ] 

Apache Spark commented on SPARK-5313:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/5072

 Create simple framework for highlighting changes introduced in a PR
 ---

 Key: SPARK-5313
 URL: https://issues.apache.org/jira/browse/SPARK-5313
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 For any given PR, we may want to run a bunch of checks along the following 
 lines: 
 * Show property X of {{master}}
 * Show the same property X of this PR
 * Call out any differences on the GitHub page
 It might be helpful to write a simple function that takes any check -- itself 
 represented as a function -- as input, runs the check on master and the PR, 
 and outputs the diff.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6363) make scala 2.11 default language

2015-03-17 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365726#comment-14365726
 ] 

Jianshi Huang commented on SPARK-6363:
--

My two cents.

The only module that's not working for 2.11 is thriftserver and 
kafka-streaming, as for Kafka they only started to support 2.11 from 0.8.2 and 
Spark kafka module depends on 0.8.0.

It should be just a change in Kafka version to make it ready for 2.11. AFAIK, 
0.8.2 client works well with previous 0.8.x server.

And we need to fix thriftserver build issue.

Jianshi

 make scala 2.11 default language
 

 Key: SPARK-6363
 URL: https://issues.apache.org/jira/browse/SPARK-6363
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: antonkulaga
Priority: Minor
  Labels: scala

 Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
 support. So, it will be better if Spark binaries would be build with Scala 
 2.11 by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6204) GenerateProjection's equals should check length equality

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-6204.
---
Resolution: Won't Fix

 GenerateProjection's equals should check length equality
 

 Key: SPARK-6204
 URL: https://issues.apache.org/jira/browse/SPARK-6204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 GenerateProjection's equals now only checks column equality. It should also 
 check length equality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-03-17 Thread jeanlyn (JIRA)
jeanlyn created SPARK-6392:
--

 Summary: [SQL]class not found exception thows when `add jar` use 
spark cli 
 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor


When we use spark cli to add jar dynamic,we will get the 
*java.lang.ClassNotFoundException* when we use the class of jar to create 
udf.For example:
{noformat}
spark-sql add jar /home/jeanlyn/hello.jar;
spark-sqlcreate temporary function hello as 'hello';
spark-sqlselect hello(name) from person;
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
java.lang.ClassNotFoundException: hello
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5651) Support 'create db.table' in HiveContext

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5651.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4427
[https://github.com/apache/spark/pull/4427]

 Support 'create db.table' in HiveContext
 

 Key: SPARK-5651
 URL: https://issues.apache.org/jira/browse/SPARK-5651
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yadong Qi
 Fix For: 1.4.0


 Now spark version is only support ```create table 
 table_in_database_creation.test1 as select * from src limit 1;``` in 
 HiveContext.
 This patch is used to support ```create table 
 `table_in_database_creation.test2` as select * from src limit 1;``` in 
 HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.

2015-03-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6366:
--
Fix Version/s: (was: 1.3.0)
   1.3.1

 In Python API, the default save mode for save and saveAsTable should be 
 error instead of append.
 

 Key: SPARK-6366
 URL: https://issues.apache.org/jira/browse/SPARK-6366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1


 If a user want to append data, he/she should explicitly specify the save 
 mode. Also, in Scala and Java, the default save mode is ErrorIfExists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5404) Statistic of Logical Plan is too aggresive

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5404.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4914
[https://github.com/apache/spark/pull/4914]

 Statistic of Logical Plan is too aggresive
 --

 Key: SPARK-5404
 URL: https://issues.apache.org/jira/browse/SPARK-5404
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.4.0


 The statistic number of a logical plan is quite helpful while do optimization 
 like join reordering, however, the default algorithm is too aggressive, which 
 probably lead to a totally wrong join order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6247) Certain self joins cannot be analyzed

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6247.
-
   Resolution: Fixed
Fix Version/s: 1.3.1

Issue resolved by pull request 5062
[https://github.com/apache/spark/pull/5062]

 Certain self joins cannot be analyzed
 -

 Key: SPARK-6247
 URL: https://issues.apache.org/jira/browse/SPARK-6247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.1


 When you try the following code
 {code}
 val df =
(1 to 10)
   .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString))
   .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2)
 df.registerTempTable(test)
 sql(
   
   |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol)
   |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1)
   |GROUP BY x.stringCol2
   .stripMargin).explain()
 {code}
 The following exception will be thrown.
 {code}
 [info]   java.util.NoSuchElementException: next on empty iterator
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
 [info]   at 
 scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
 [info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
 [info]   at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 [info]   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 [info]   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 [info]   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
 [info]   at scala.collection.immutable.List.foldLeft(List.scala:84)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071)
 [info]   at 
 

[jira] [Commented] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366506#comment-14366506
 ] 

Apache Spark commented on SPARK-6392:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/5079

 [SQL]class not found exception thows when `add jar` use spark cli 
 --

 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor

 When we use spark cli to add jar dynamic,we will get the 
 *java.lang.ClassNotFoundException* when we use the class of jar to create 
 udf.For example:
 {noformat}
 spark-sql add jar /home/jeanlyn/hello.jar;
 spark-sqlcreate temporary function hello as 'hello';
 spark-sqlselect hello(name) from person;
 Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
 recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
 java.lang.ClassNotFoundException: hello
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6393) Extra RPC to the AM during killExecutor invocation

2015-03-17 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-6393:
-

 Summary: Extra RPC to the AM during killExecutor invocation
 Key: SPARK-6393
 URL: https://issues.apache.org/jira/browse/SPARK-6393
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.3.1
Reporter: Sandy Ryza


This was introduced by SPARK-6325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-17 Thread haiyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366569#comment-14366569
 ] 

haiyang commented on SPARK-6200:


Thank you for your comments.

The core idea of this implementation is to provide two interface {{Dialect}} 
and {{DialectManager}}.Every dialect must implement the {{Dialect}} interface 
if they want to use their own dialects.{{DialectManager}} is to manager all 
kinds of {{Dialect}}, and there is a default implementation of this interface, 
even we can open a api that allows users to provide their own 
{{DialectManager}} if they want to.If you can't aggree with this core idea, I 
will close this.

Other questions as to what you say,I think I misunderstood the original design 
intent,but is just {{DefaultDialectManager}} implementation problem.At a high 
level,in other words, it is just different {{DialectManager}} implementation 
problem.

We can simply modify little code to achive the following goals:

1. always parses Spark SQL its own DDL first:
a. add DDLParser filed in {{DefaultDialectManager}}
b. change its parse method like this: {{ddlParser(sqlText, 
false).getOrElse(currentDialect.parse(sqlText))}}

2. switch dialect through the open api {{SET spark.sql.dialect}} : 
 drop curDialect field in {{DefaultDialectManager}}, use 
{{sqlContext.conf.dialect}} to read and switch dialect.

3. drop dialect commands to make code simpler.

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.

2015-03-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6366.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 5053
[https://github.com/apache/spark/pull/5053]

 In Python API, the default save mode for save and saveAsTable should be 
 error instead of append.
 

 Key: SPARK-6366
 URL: https://issues.apache.org/jira/browse/SPARK-6366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 If a user want to append data, he/she should explicitly specify the save 
 mode. Also, in Scala and Java, the default save mode is ErrorIfExists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6383) Few examples on Dataframe operation give compiler errors

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6383.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Few examples on Dataframe operation give compiler errors 
 -

 Key: SPARK-6383
 URL: https://issues.apache.org/jira/browse/SPARK-6383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial
 Fix For: 1.3.0


 The below statements give compiler errors as 
 a) the select method doesnot accept String, Column 
 df.select(name, df(age) + 1).show() // Need to convert String to Column
 b) Filtering should be based on age  not on name  Column
 df.filter(df(name)  21).show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6383) Few examples on Dataframe operation give compiler errors

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6383:
---
Fix Version/s: (was: 1.3.0)
   1.3.1
   1.4.0

 Few examples on Dataframe operation give compiler errors 
 -

 Key: SPARK-6383
 URL: https://issues.apache.org/jira/browse/SPARK-6383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial
  Labels: DataFrame
 Fix For: 1.4.0, 1.3.1


 The below statements give compiler errors as 
 a) the select method doesnot accept String, Column 
 df.select(name, df(age) + 1).show() // Need to convert String to Column
 b) Filtering should be based on age  not on name  Column
 df.filter(df(name)  21).show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6383) Few examples on Dataframe operation give compiler errors

2015-03-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6383:
---
Labels: DataFrame  (was: )

 Few examples on Dataframe operation give compiler errors 
 -

 Key: SPARK-6383
 URL: https://issues.apache.org/jira/browse/SPARK-6383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Tijo Thomas
Priority: Trivial
  Labels: DataFrame
 Fix For: 1.4.0, 1.3.1


 The below statements give compiler errors as 
 a) the select method doesnot accept String, Column 
 df.select(name, df(age) + 1).show() // Need to convert String to Column
 b) Filtering should be based on age  not on name  Column
 df.filter(df(name)  21).show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5908) Hive udtf with single alias should be resolved correctly

2015-03-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5908.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4692
[https://github.com/apache/spark/pull/4692]

 Hive udtf with single alias should be resolved correctly
 

 Key: SPARK-5908
 URL: https://issues.apache.org/jira/browse/SPARK-5908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
 Fix For: 1.4.0


 ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with 
 multiple alias. When only single alias is used with HiveGenericUdtf, the 
 alias is not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366551#comment-14366551
 ] 

Apache Spark commented on SPARK-6356:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5080

 Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext
 --

 Key: SPARK-6356
 URL: https://issues.apache.org/jira/browse/SPARK-6356
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yadong Qi

 Support for the expression below:
 ```
 GROUP BY expression list WITH ROLLUP
 GROUP BY expression list WITH CUBE
 GROUP BY expression list GROUPING SETS(expression list2)
 ```
 and
 ```
 GROUP BY ROLLUP(expression list)
 GROUP BY CUBE(expression list)
 GROUP BY GROUPING SETS(expression list)
 ```
 and
 ```
 GROUPING (expression list)
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366394#comment-14366394
 ] 

Apache Spark commented on SPARK-6330:
-

User 'ypcat' has created a pull request for this issue:
https://github.com/apache/spark/pull/5039

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
 Fix For: 1.4.0, 1.3.1


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366237#comment-14366237
 ] 

Apache Spark commented on SPARK-5654:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5077

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366269#comment-14366269
 ] 

Apache Spark commented on SPARK-6250:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5078

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366284#comment-14366284
 ] 

Yin Huai commented on SPARK-6250:
-

[~nitay] Can you try https://github.com/apache/spark/pull/5078 and see if it 
works for your case?

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6248) LocalRelation needs to implement statistics

2015-03-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6248:

Target Version/s: 1.3.1  (was: 1.4.0)

 LocalRelation needs to implement statistics
 ---

 Key: SPARK-6248
 URL: https://issues.apache.org/jira/browse/SPARK-6248
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 We need to implement statistics for LocalRelation. Otherwise, we cannot join 
 a LocalRelation with other tables. The following exception will be thrown.
 {code}
 java.lang.UnsupportedOperationException: LeafNode LocalRelation must 
 implement statistics.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6248) LocalRelation needs to implement statistics

2015-03-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366333#comment-14366333
 ] 

Yin Huai commented on SPARK-6248:
-

https://github.com/apache/spark/pull/5062 will also fix it.

 LocalRelation needs to implement statistics
 ---

 Key: SPARK-6248
 URL: https://issues.apache.org/jira/browse/SPARK-6248
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 We need to implement statistics for LocalRelation. Otherwise, we cannot join 
 a LocalRelation with other tables. The following exception will be thrown.
 {code}
 java.lang.UnsupportedOperationException: LeafNode LocalRelation must 
 implement statistics.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-03-17 Thread Olivier Toupin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366295#comment-14366295
 ] 

Olivier Toupin commented on SPARK-6154:
---

I made an unperfect fix for hive-thriftserver.

https://github.com/oliviertoupin/spark/compare/v1.3.0...oliviertoupin:olivier-sql-cli-2.11

To test:

Download 1.3.0.
dev/change-version-to-2.11.sh
mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive 
-Phive-thriftserver -DskipTests clean package

This basically shade jline for the spark repl, and the hive jline version for 
sql-cli to avoid conflict.

For me everything seems to work fine, but feel free to test further.

One problem: When running spark-shell I get this error:

Failed to created JLineReader: java.lang.NoClassDefFoundError: 
jline/console/completer/Completer
Falling back to SimpleReader.

This prevent autocompletion from working.

So it might not be a mergeable patch for now, but might be helpful for people 
that, like us, want to run 2.11 and spark-sql. You lose autocompletion (you 
were using it?), but both components work with 2.11.

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang

 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6248) LocalRelation needs to implement statistics

2015-03-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6248:

Description: 
We need to implement statistics for LocalRelation. Otherwise, we cannot join a 
LocalRelation with other tables. The following exception will be thrown.

{code}
java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement 
statistics.
{code}


  was:
We need to implement statistics for LocalRelation. Otherwise, we cannot join a 
LocalRelation with other tables. The following exception will be thrown.

{code}
java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement 
statistics.
{code]


 LocalRelation needs to implement statistics
 ---

 Key: SPARK-6248
 URL: https://issues.apache.org/jira/browse/SPARK-6248
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 We need to implement statistics for LocalRelation. Otherwise, we cannot join 
 a LocalRelation with other tables. The following exception will be thrown.
 {code}
 java.lang.UnsupportedOperationException: LeafNode LocalRelation must 
 implement statistics.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-17 Thread Calvin Jia (JIRA)
Calvin Jia created SPARK-6391:
-

 Summary: Update Tachyon version compatibility documentation
 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Calvin Jia


Tachyon v0.6 has an api change in the client, it would be helpful to document 
the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-17 Thread Haoyuan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-6391:
--
Component/s: (was: Spark Core)
 Documentation

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Calvin Jia

 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-17 Thread Haoyuan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-6391:
--
Fix Version/s: 1.4.0

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Calvin Jia
 Fix For: 1.4.0


 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-17 Thread Haoyuan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-6391:
--
Affects Version/s: 1.3.0

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Calvin Jia
 Fix For: 1.4.0


 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366021#comment-14366021
 ] 

Xiangrui Meng commented on SPARK-6334:
--

Couple suggestions before SPARK-5955 is implemented:

1. Upgrade to Spark 1.3. ALS receives a new implementation in 1.3, where the 
shuffle size is reduced.
2. Use less number of blocks, even you have more CPU cores. There is a 
trade-off between communication and computation. With k = 50, I think the 
communication still dominates.
3. Minor. Build Spark with -Pnetlib-lgpl to include native BLAS/LAPACK 
libraries.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-03-17 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365581#comment-14365581
 ] 

Reza Zadeh commented on SPARK-4590:
---

Hi Qiping, We are waiting for IndexedRDD to be optimized and merged into Spark 
master.


 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE

2015-03-17 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-3351.

Resolution: Invalid

 Yarn YarnRMClientImpl.shutdown can be called before register - NPE
 --

 Key: SPARK-3351
 URL: https://issues.apache.org/jira/browse/SPARK-3351
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 If the SparkContext exits while its in the 
 applicationmaster.waitForSparkContextInitialized then the 
 YarnRMClientImpl.shutdown can be called before register and you get a null 
 pointer exception on the uihistoryAddress.
 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with 
 FAILED (diag message: Timed out waiting for SparkContext.)
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121)
   at 
 org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE

2015-03-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365607#comment-14365607
 ] 

Thomas Graves commented on SPARK-3351:
--

ok lets close this and reopen if we see it again.

 Yarn YarnRMClientImpl.shutdown can be called before register - NPE
 --

 Key: SPARK-3351
 URL: https://issues.apache.org/jira/browse/SPARK-3351
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 If the SparkContext exits while its in the 
 applicationmaster.waitForSparkContextInitialized then the 
 YarnRMClientImpl.shutdown can be called before register and you get a null 
 pointer exception on the uihistoryAddress.
 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with 
 FAILED (diag message: Timed out waiting for SparkContext.)
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121)
   at 
 org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6336:
-
Fix Version/s: (was: 1.3.0)
   1.3.1
   1.4.0

 LBFGS should document what convergenceTol means
 ---

 Key: SPARK-6336
 URL: https://issues.apache.org/jira/browse/SPARK-6336
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.4.0, 1.3.1


 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance.  We 
 should document that convergenceTol is relative and ensure in a unit test 
 that this behavior does not change in Breeze without us realizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6336) LBFGS should document what convergenceTol means

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6336.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 5033
[https://github.com/apache/spark/pull/5033]

 LBFGS should document what convergenceTol means
 ---

 Key: SPARK-6336
 URL: https://issues.apache.org/jira/browse/SPARK-6336
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.3.0


 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance.  We 
 should document that convergenceTol is relative and ensure in a unit test 
 that this behavior does not change in Breeze without us realizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6336:
-
Assignee: Kai Sasaki

 LBFGS should document what convergenceTol means
 ---

 Key: SPARK-6336
 URL: https://issues.apache.org/jira/browse/SPARK-6336
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki
Priority: Trivial
 Fix For: 1.4.0, 1.3.1


 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance.  We 
 should document that convergenceTol is relative and ensure in a unit test 
 that this behavior does not change in Breeze without us realizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4416) Support Mesos framework authentication

2015-03-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366011#comment-14366011
 ] 

Adam B commented on SPARK-4416:
---

Can close as duplicate of SPARK-6284

 Support Mesos framework authentication
 --

 Key: SPARK-4416
 URL: https://issues.apache.org/jira/browse/SPARK-4416
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Support Mesos framework authentication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5955) Add checkpointInterval to ALS

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-5955:


Assignee: Xiangrui Meng

 Add checkpointInterval to ALS
 -

 Key: SPARK-5955
 URL: https://issues.apache.org/jira/browse/SPARK-5955
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We should add checkpoint interval to ALS to prevent the following:
 1. storing large shuffle files
 2. stack overflow (SPARK-1106)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365895#comment-14365895
 ] 

Apache Spark commented on SPARK-5750:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5074

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6226) Support model save/load in Python's KMeans

2015-03-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6226.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5049
[https://github.com/apache/spark/pull/5049]

 Support model save/load in Python's KMeans
 --

 Key: SPARK-6226
 URL: https://issues.apache.org/jira/browse/SPARK-6226
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365896#comment-14365896
 ] 

Apache Spark commented on SPARK-3441:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5074

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >