[jira] [Commented] (SPARK-7324) Add DataFrame.dropDuplicates

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525686#comment-14525686
 ] 

Apache Spark commented on SPARK-7324:
-

User 'kaka1992' has created a pull request for this issue:
https://github.com/apache/spark/pull/5870

 Add DataFrame.dropDuplicates
 

 Key: SPARK-7324
 URL: https://issues.apache.org/jira/browse/SPARK-7324
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Similar to 
 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
 def dropDuplicates(): DataFrame
 def dropDuplicates(subset: Seq[String]): DataFrame
 We can turn this into groupBy(cols).agg(first(...))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7324) Add DataFrame.dropDuplicates

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7324:
---

Assignee: Apache Spark

 Add DataFrame.dropDuplicates
 

 Key: SPARK-7324
 URL: https://issues.apache.org/jira/browse/SPARK-7324
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 Similar to 
 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
 def dropDuplicates(): DataFrame
 def dropDuplicates(subset: Seq[String]): DataFrame
 We can turn this into groupBy(cols).agg(first(...))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Wesley Miao (JIRA)
Wesley Miao created SPARK-7326:
--

 Summary: Performing window() on a WindowedDStream doesn't work all 
the time
 Key: SPARK-7326
 URL: https://issues.apache.org/jira/browse/SPARK-7326
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Wesley Miao


Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)

val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)

l1.print()
l2.print()

ssc.start()
ssc.awaitTermination()
}

Here we have two windowed DStream instance l1 and l2. 

l1 is the result DStream by performing a window() on the source DStream with 
both window and sliding duration 5 times the batch internal of the source 
stream.

l2 is the result DStream by performing a window() on l1, with both window and 
sliding duration 3 times l1's batch interval, which is 15 times of the source 
stream.

From the output of this simple streaming app, I can only see print data output 
from l1 and no data printed from l2.

Diving into the source code, I found the problem may most likely reside in 
DStream.slice() implementation, as shown below.

  def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
  throw new SparkException(this +  has not been initialized)
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(fromTime ( + fromTime + ) is not a multiple of 
slideDuration (
+ slideDuration + ))
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration 
(
+ slideDuration + ))
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

logInfo(Slicing from  + fromTime +  to  + toTime +
   (aligned to  + alignedFromTime +  and  + alignedToTime + ))

alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
  if (time = zeroTime) getOrCompute(time) else None
})
  }

Here after performing floor() on both fromTime and toTime, the result 
(alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
multiple of the slidingDuration, thus making isTimeValid check failed for all 
the remaining computation.

The fix would be to add a new floor() function in Time.scala to respect the 
zeroTime while performing the floor :

  def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
zeroTime.milliseconds)
  } 

And then change the DStream.slice to call this new floor function by passing in 
its zeroTime.

val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

This way the alignedToTime and alignedFromTime are *really* aligned in respect 
to zeroTime whose value is not really a 0.
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7324) Add DataFrame.dropDuplicates

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7324:
---

Assignee: (was: Apache Spark)

 Add DataFrame.dropDuplicates
 

 Key: SPARK-7324
 URL: https://issues.apache.org/jira/browse/SPARK-7324
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Similar to 
 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
 def dropDuplicates(): DataFrame
 def dropDuplicates(subset: Seq[String]): DataFrame
 We can turn this into groupBy(cols).agg(first(...))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6026) Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code

2015-05-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6026:
--
Component/s: Shuffle

 Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle 
 within the Sort shuffle code
 -

 Key: SPARK-6026
 URL: https://issues.apache.org/jira/browse/SPARK-6026
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.3.0
Reporter: Kay Ousterhout

 The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle 
 when the number of partitions is less than this) is basically a workaround 
 for SparkSQL, because the fact that the sort-based shuffle stores 
 non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. 
  Once the sort-based shuffle is changed to store serialized objects, we 
 should never be secretly doing hash-ish shuffle even when the user has 
 specified to use sort-based shuffle (because of its otherwise worse 
 performance).
 [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from 
 you on this proposal (and also a sanity check that I've correctly 
 characterized the issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones

2015-05-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526291#comment-14526291
 ] 

Xiangrui Meng commented on SPARK-6411:
--

[~airhorns] I'm testing Spark with Pyrolite master branch. With your patch, it 
is possible to create DFs with datetime that contains tzinfo. However, I found 
two issues:

1. A tz-unaware date (or maybe datetime) object becomes tz-aware after a round 
trip, which makes `test_apply_schema` in `sql/tests.py` fail.
2. The tzinfo does not remain the same after a round trip.

Test code:
{code}
def test_datetime_with_timezone(self):

SPARK-6411

try:
import pytz
has_pytz = True
except:
has_pytz = False

if has_pytz:
tz = pytz.timezone('America/Los_Angeles')
date = datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tz)
first = self.sqlCtx.createDataFrame([(date,)]).first()[0]
self.assertEqual(date, first)
{code}

Output:
{code}
==
FAIL: test_datetime_with_timezone (__main__.SQLTests)
--
Traceback (most recent call last):
  File /Users/meng/src/spark/python/pyspark/sql/tests.py, line 536, in 
test_datetime_with_timezone
self.assertEqual(date, first)
AssertionError: datetime.datetime(2014, 7, 8, 11, 10, tzinfo=DstTzInfo 
'America/Los_Angeles' PST-1 day, 16:00:00 STD) != datetime.datetime(2014, 7, 
8, 11, 10, tzinfo=DstTzInfo 'America/Los_Angeles' PDT-1 day, 17:00:00 DST)
{code}

I will check the conversion for date/datetime. It would be really helpful if 
you could provide insights.

 PySpark DataFrames can't be created if any datetimes have timezones
 ---

 Key: SPARK-6411
 URL: https://issues.apache.org/jira/browse/SPARK-6411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Harry Brundage
Assignee: Xiangrui Meng

 I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
 objects that pass through the conversion process have a {{tzinfo}} property 
 set. 
 This works fine:
 {code}
 In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
 10),)]).toDF().collect()
 Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
 {code}
 as expected, the tuple's schema is inferred as having one anonymous column 
 with a datetime field, and the datetime roundtrips through to the Java side 
 python deserialization and then back into python land upon {{collect}}. This 
 however:
 {code}
 In [5]: from dateutil.tz import tzutc
 In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
 tzinfo=tzutc()),)]).toDF().collect()
 {code}
 explodes with
 {code}
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
 pickle data for datetime; expected 1 or 7 args, got 2
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

[jira] [Closed] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-05-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly closed SPARK-4184.
---
Resolution: Duplicate

we'll incorporate changes in incrementally

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-05-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6654:

Priority: Major  (was: Blocker)
Target Version/s: 1.5.0  (was: 1.4.0)

 Update Kinesis Streaming impls (both KCL-based and Direct) to use latest 
 aws-java-sdk and kinesis-client-library
 

 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6907) Create an isolated classloader for the Hive Client.

2015-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6907.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5851
[https://github.com/apache/spark/pull/5851]

 Create an isolated classloader for the Hive Client.
 ---

 Key: SPARK-6907
 URL: https://issues.apache.org/jira/browse/SPARK-6907
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7302) SPARK building documentation still mentions building for yarn 0.23

2015-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7302.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Sean Owen

Resolved by https://github.com/apache/spark/pull/5863

 SPARK building documentation still mentions building for yarn 0.23
 --

 Key: SPARK-7302
 URL: https://issues.apache.org/jira/browse/SPARK-7302
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Thomas Graves
Assignee: Sean Owen
 Fix For: 1.4.0


 as of SPARK-3445 we deprecated using hadoop 0.23.  It looks like the building 
 documentation still references it though. We should remove that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Wesley Miao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526062#comment-14526062
 ] 

Wesley Miao commented on SPARK-7326:


What I'd like to achieve is to do multiple-level aggregations on the source of 
logging stream. For example - let's say the source logging stream is at 
interval 1 second. The first level aggregation would be every 1 minute, which 
is the 60 intervals of the source stream. The second level would be every 1 
hour. The third level would be 1 day. And we can do more levels if we want.

What I hope is that at each level we'll do reduceByKey(_ + _) so that its 
aggregation can be done against its immediate previous level, instead of always 
aggregating against the source stream. Level 3's reduceByKey will be based on 
level 2's result, level 2 is based one level 1 and level 1 is based on the 
source stream.

I would think this approach will be more efficient than always reduce over the 
source stream, particularly for the higher level (like daily and weekly) 
aggregation.  

 Performing window() on a WindowedDStream doesn't work all the time
 --

 Key: SPARK-7326
 URL: https://issues.apache.org/jira/browse/SPARK-7326
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Wesley Miao

 Someone reported similar issues before but got no response.
 http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html
 And I met the same issue recently and it can be reproduced in 1.3.1 by the 
 following piece of code:
 def main(args: Array[String]) {
 val batchInterval = 1234
 val sparkConf = new SparkConf()
   .setAppName(WindowOnWindowedDStream)
   .setMaster(local[2])
 val ssc =  new StreamingContext(sparkConf, 
 Milliseconds(batchInterval.toInt))
 ssc.checkpoint(checkpoint)
 def createRDD(i: Int) : RDD[(String, Int)] = {
   val count = 1000
   val rawLogs = (1 to count).map{ _ =
 val word = word + Random.nextInt.abs % 5
 (word, 1)
   }
   ssc.sparkContext.parallelize(rawLogs)
 }
 val rddQueue = mutable.Queue[RDD[(String, Int)]]()
 val rawLogStream = ssc.queueStream(rddQueue)
 (1 to 300) foreach { i =
   rddQueue.enqueue(createRDD(i))
 }
 val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
 Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)
 val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
 Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)
 l1.print()
 l2.print()
 ssc.start()
 ssc.awaitTermination()
 }
 Here we have two windowed DStream instance l1 and l2. 
 l1 is the result DStream by performing a window() on the source DStream with 
 both window and sliding duration 5 times the batch internal of the source 
 stream.
 l2 is the result DStream by performing a window() on l1, with both window and 
 sliding duration 3 times l1's batch interval, which is 15 times of the source 
 stream.
 From the output of this simple streaming app, I can only see print data 
 output from l1 and no data printed from l2.
 Diving into the source code, I found the problem may most likely reside in 
 DStream.slice() implementation, as shown below.
   def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
 if (!isInitialized) {
   throw new SparkException(this +  has not been initialized)
 }
 if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(fromTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(toTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 logInfo(Slicing from  + fromTime +  to  + toTime +
(aligned to  + alignedFromTime +  and  + alignedToTime + ))
 alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
   if (time = zeroTime) getOrCompute(time) else None
 })
   }
 Here after performing floor() on both fromTime and toTime, the result 
 (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
 multiple of the slidingDuration, thus making isTimeValid() check failed for 
 all the remaining computation.
 The fix would be to add a new floor() function in Time.scala to respect the 
 zeroTime while performing the floor :
   def floor(that: Duration, zeroTime: Time): Time = {
 val t = that.milliseconds
 new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
 zeroTime.milliseconds)
   } 
 And then change 

[jira] [Assigned] (SPARK-6908) Refactor existing code to use the isolated client

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6908:
---

Assignee: (was: Apache Spark)

 Refactor existing code to use the isolated client
 -

 Key: SPARK-6908
 URL: https://issues.apache.org/jira/browse/SPARK-6908
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6908) Refactor existing code to use the isolated client

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526104#comment-14526104
 ] 

Apache Spark commented on SPARK-6908:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5876

 Refactor existing code to use the isolated client
 -

 Key: SPARK-6908
 URL: https://issues.apache.org/jira/browse/SPARK-6908
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6908) Refactor existing code to use the isolated client

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6908:
---

Assignee: Apache Spark

 Refactor existing code to use the isolated client
 -

 Key: SPARK-6908
 URL: https://issues.apache.org/jira/browse/SPARK-6908
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7302) SPARK building documentation still mentions building for yarn 0.23

2015-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7302:
-
Priority: Minor  (was: Major)

 SPARK building documentation still mentions building for yarn 0.23
 --

 Key: SPARK-7302
 URL: https://issues.apache.org/jira/browse/SPARK-7302
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Thomas Graves
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 as of SPARK-3445 we deprecated using hadoop 0.23.  It looks like the building 
 documentation still references it though. We should remove that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements

2015-05-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526019#comment-14526019
 ] 

Sean Owen commented on SPARK-6873:
--

Sorry for the late reply, missed this. Yes, HashMap ordering did change in Java 
8. Hm, so it sounds like the test can't pass on both Java 7 and 8 at the same 
time then, since the golden output would fail to match one or the other?

or are you saying there's a way to ignore this part of the output?

Otherwise we can track this for a later version like Spark 2.x where we may 
require Java 8.

 Some Hive-Catalyst comparison tests fail due to unimportant order of some 
 printed elements
 --

 Key: SPARK-6873
 URL: https://issues.apache.org/jira/browse/SPARK-6873
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Cheng Lian
Priority: Minor

 As I mentioned, I've been seeing 4 test failures in Hive tests for a while, 
 and actually it still affects master. I think it's a superficial problem that 
 only turns up when running on Java 8, but still, would probably be an easy 
 fix and good to fix.
 Specifically, here are four tests and the bit that fails the comparison, 
 below. I tried to diagnose this but had trouble even finding where some of 
 this occurs, like the list of synonyms?
 {code}
 - show_tblproperties *** FAILED ***
   Results do not match for show_tblproperties:
 ...
   !== HIVE - 2 row(s) ==   == CATALYST - 2 row(s) ==
   !tmptruebar bar value
   !barbar value   tmp true (HiveComparisonTest.scala:391)
 {code}
 {code}
 - show_create_table_serde *** FAILED ***
   Results do not match for show_create_table_serde:
 ...
WITH SERDEPROPERTIES (  WITH 
 SERDEPROPERTIES ( 
   !  'serialization.format'='$', 
 'field.delim'=',', 
   !  'field.delim'=',')  
 'serialization.format'='$')
 {code}
 {code}
 - udf_std *** FAILED ***
   Results do not match for udf_std:
 ...
   !== HIVE - 2 row(s) == == CATALYST 
 - 2 row(s) ==
std(x) - Returns the standard deviation of a set of numbers   std(x) - 
 Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stddev  Synonyms: 
 stddev, stddev_pop (HiveComparisonTest.scala:391)
 {code}
 {code}
 - udf_stddev *** FAILED ***
   Results do not match for udf_stddev:
 ...
   !== HIVE - 2 row(s) ==== 
 CATALYST - 2 row(s) ==
stddev(x) - Returns the standard deviation of a set of numbers   stddev(x) 
 - Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stdSynonyms: 
 std, stddev_pop (HiveComparisonTest.scala:391)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7013) Add unit test for spark.ml StandardScaler

2015-05-03 Thread Glenn Weidner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526042#comment-14526042
 ] 

Glenn Weidner commented on SPARK-7013:
--

I would like to work on this.  I've started looking into the differences 
between spark.ml StandardScaler and mllib StandardScaler including mlib's 
existing test StandardScalerSuite.

 Add unit test for spark.ml StandardScaler
 -

 Key: SPARK-7013
 URL: https://issues.apache.org/jira/browse/SPARK-7013
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 1.3.1
Reporter: Joseph K. Bradley
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7275) Make LogicalRelation public

2015-05-03 Thread Glenn Weidner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526098#comment-14526098
 ] 

Glenn Weidner commented on SPARK-7275:
--

I checked the history of the file org.apache.spark.sql.sources.LogicalRelation 
and it looks like LogicalRelation has always been declared private[sql].  Can 
you provide more details as to how this makes it harder to work with full 
logical plans from third party packages.

 Make LogicalRelation public
 ---

 Key: SPARK-7275
 URL: https://issues.apache.org/jira/browse/SPARK-7275
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola
Priority: Minor

 It seems LogicalRelation is the only part of the LogicalPlan that is not 
 public. This makes it harder to work with full logical plans from third party 
 packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-05-03 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153
 ] 

Sen Fang commented on SPARK-2336:
-

Hey Longbao, great to hear from you. To my best understanding, in the paper I 
cited above, they are distributing the input points by pushing them through the 
top tree (figure 4). Of course, for a precise result, this means we need to 
backtrack which isn't very ideal. What they propose is to use a buffer boundary 
like a spill tree. However unlike spill tree, here you would push the input 
targets to both children if it falls within the buffer zone, because the top 
tree was built as a metric tree (they explained the reason being a spill tree 
as top tree has a high memory penalty). So every input now might end up in 
multiple subtrees and you will need to reduceByKey at the end to keep the top K 
neighbors.

Is your implementation available somewhere? I'm having a hard time to find time 
to finish my implementation this month. Would be great if eventually we can 
compare our implementations, validate and benchmark.

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage

2015-05-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6943:
-
Attachment: (was: with-stack-trace.png)

 Graphically show RDD's included in a stage
 --

 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, 
 stage-page.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage

2015-05-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6943:
-
Attachment: (was: with-closures.png)

 Graphically show RDD's included in a stage
 --

 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Attachments: DAGvisualizationintheSparkWebUI.pdf, with-stack-trace.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7330:
---

Assignee: Apache Spark

 JDBC RDD could lead to NPE when the date field is null
 --

 Key: SPARK-7330
 URL: https://issues.apache.org/jira/browse/SPARK-7330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Apache Spark

 because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or 
 not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage

2015-05-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526181#comment-14526181
 ] 

Andrew Or commented on SPARK-6943:
--

Hi [~kayousterhout] I updated the patch to include both the job view and the 
stage view. Please have a look.

 Graphically show RDD's included in a stage
 --

 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, 
 stage-page.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2015-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-3524:


Assignee: Xiangrui Meng

 remove workaround to pickle array of float for Pyrolite
 ---

 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Xiangrui Meng

 After Pyrolite release a new version with PR 
 https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
 introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2015-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3524:
-
Target Version/s: 1.4.0  (was: 1.2.0)

 remove workaround to pickle array of float for Pyrolite
 ---

 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Xiangrui Meng

 After Pyrolite release a new version with PR 
 https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
 introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7327) DataFrame show() method doesn't like empty dataframes

2015-05-03 Thread Olivier Girardot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526231#comment-14526231
 ] 

Olivier Girardot commented on SPARK-7327:
-

ok thx

 DataFrame show() method doesn't like empty dataframes
 -

 Key: SPARK-7327
 URL: https://issues.apache.org/jira/browse/SPARK-7327
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Olivier Girardot
Priority: Minor

 For an empty DataFrame (for exemple after a filter) any call to show() ends 
 up with : 
 {code}
 java.util.MissingFormatWidthException: -0s
   at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906)
   at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680)
   at java.util.Formatter.parse(Formatter.java:2528)
   at java.util.Formatter.format(Formatter.java:2469)
   at java.util.Formatter.format(Formatter.java:2423)
   at java.lang.String.format(String.java:2790)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320)
 {code}
 If no-one takes it by next friday, I'll fix it, the problem seems to come 
 from the colWidths method :
 {code}
// Compute the width of each column
 val colWidths = Array.fill(numCols)(0)
 for (row - rows) {
   for ((cell, i) - row.zipWithIndex) {
 colWidths(i) = math.max(colWidths(i), cell.length)
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7332) RpcCallContext.sender has a different name from the original sender's name

2015-05-03 Thread Qiping Li (JIRA)
Qiping Li created SPARK-7332:


 Summary: RpcCallContext.sender has a different name from the 
original sender's name
 Key: SPARK-7332
 URL: https://issues.apache.org/jira/browse/SPARK-7332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Qiping Li
Priority: Critical
 Fix For: 1.3.2, 1.4.0


In the function {{receiveAndReply}} of {{RpcEndpoint}}, we get the sender of 
the received message through {{context.sender}}. But this doesn't work because 
we don't get the right {{RpcEndpointRef}}. It's name is different from the 
original sender's name, so the path is different.

Here is the code to test it:

{code}
case class Greeting(who: String)

class GreetingActor(override val rpcEnv: RpcEnv) extends RpcEndpoint with 
Logging {
  override def receiveAndReply(context: RpcCallContext) : PartialFunction[Any, 
Unit] = {
case Greeting(who) =
  logInfo(Hello  + who)
  logInfo(s${context.sender.name})
  }

}

class ToSend(override val rpcEnv: RpcEnv, greeting: RpcEndpointRef) extends 
RpcEndpoint with Logging {
  override def onStart(): Unit = {
logInfo(s${self.name})
greeting.ask(Greeting(Charlie Parker))
  }
}

object RpcEndpointNameTest {
  def main(args: Array[String]): Unit = {
val actorSystemName = driver
val conf = new SparkConf
val rpcEnv = RpcEnv.create(actorSystemName, localhost, 0, conf, new 
SecurityManager(conf))
val greeter = rpcEnv.setupEndpoint(greeter, new GreetingActor(rpcEnv))
rpcEnv.setupEndpoint(toSend, new ToSend(rpcEnv, greeter))
  }
}
{code}

The result was:
{code}
toSend
Hello Charlie Parker
$a
{code}

I test the above code using akka with the following code:

{code}
case class Greeting(who: String)

class GreetingActor extends Actor with ActorLogging {
  def receive = {
case Greeting(who) =
  println(Hello  + who)
  println(s${sender.path} ${sender.path.name})
  }

}

class ToSend(greeting: ActorRef) extends Actor with ActorLogging {
  override def preStart(): Unit = {
println(s${self.path} ${self.path.name})
greeting ! Greeting(Charlie Parker)
  }

  def receive = {
case _ =
  log.info(here)
  }
}

object HelloWorld {
  def main(args: Array[String]): Unit = {
val system = ActorSystem(MySystem)
val greeter = system.actorOf(Props[GreetingActor], name = greeter)
println(s${greeter.path} ${greeter.path.name})
val system2 = ActorSystem(MySystem2)
system2.actorOf(Props(classOf[ToSend], greeter), name = toSend_2)
  }
}
{code}

And the result was:

{code}
akka://MySystem/user/greeter greeter
akka://MySystem2/user/toSend_2 toSend_2
Hello Charlie Parker
akka://MySystem2/user/toSend_2 toSend_2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6602:
---
Priority: Critical  (was: Major)

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5293:
---
Target Version/s: 1.6.0

 Enable Spark user applications to use different versions of Akka
 

 Key: SPARK-5293
 URL: https://issues.apache.org/jira/browse/SPARK-5293
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 A lot of Spark user applications are using (or want to use) Akka. Akka as a 
 whole can contribute great architectural simplicity and uniformity. However, 
 because Spark depends on Akka, it is not possible for users to rely on 
 different versions, and we have received many requests in the past asking for 
 help about this specific issue. For example, Spark Streaming might be used as 
 the receiver of Akka messages - but our dependency on Akka requires the 
 upstream Akka actors to also use the identical version of Akka.
 Since our usage of Akka is limited (mainly for RPC and single-threaded event 
 loop), we can replace it with alternative RPC implementations and a common 
 event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6602:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6944:
---

Assignee: Andrew Or  (was: Apache Spark)

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null

2015-05-03 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-7330:
--

 Summary: JDBC RDD could lead to NPE when the date field is null
 Key: SPARK-7330
 URL: https://issues.apache.org/jira/browse/SPARK-7330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526178#comment-14526178
 ] 

Apache Spark commented on SPARK-6944:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/5729

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7113:
---

Assignee: Apache Spark

 Add the direct stream related information to the streaming listener and web UI
 --

 Key: SPARK-7113
 URL: https://issues.apache.org/jira/browse/SPARK-7113
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Saisai Shao
Assignee: Apache Spark
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526202#comment-14526202
 ] 

Apache Spark commented on SPARK-7113:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5879

 Add the direct stream related information to the streaming listener and web UI
 --

 Key: SPARK-7113
 URL: https://issues.apache.org/jira/browse/SPARK-7113
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Saisai Shao
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7113:
---

Assignee: (was: Apache Spark)

 Add the direct stream related information to the streaming listener and web UI
 --

 Key: SPARK-7113
 URL: https://issues.apache.org/jira/browse/SPARK-7113
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Saisai Shao
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7331:
---

Assignee: Apache Spark

 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Assignee: Apache Spark
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7331:
---

Assignee: (was: Apache Spark)

 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526223#comment-14526223
 ] 

Apache Spark commented on SPARK-7331:
-

User 'nitin2goyal' has created a pull request for this issue:
https://github.com/apache/spark/pull/5880

 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7331:
---
Description: 
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets created once.

  was:
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.


 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets created once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7275) Make LogicalRelation public

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7275:
---

Assignee: (was: Apache Spark)

 Make LogicalRelation public
 ---

 Key: SPARK-7275
 URL: https://issues.apache.org/jira/browse/SPARK-7275
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola
Priority: Minor

 It seems LogicalRelation is the only part of the LogicalPlan that is not 
 public. This makes it harder to work with full logical plans from third party 
 packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7275) Make LogicalRelation public

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526227#comment-14526227
 ] 

Apache Spark commented on SPARK-7275:
-

User 'gweidner' has created a pull request for this issue:
https://github.com/apache/spark/pull/5881

 Make LogicalRelation public
 ---

 Key: SPARK-7275
 URL: https://issues.apache.org/jira/browse/SPARK-7275
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola
Priority: Minor

 It seems LogicalRelation is the only part of the LogicalPlan that is not 
 public. This makes it harder to work with full logical plans from third party 
 packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7275) Make LogicalRelation public

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7275:
---

Assignee: Apache Spark

 Make LogicalRelation public
 ---

 Key: SPARK-7275
 URL: https://issues.apache.org/jira/browse/SPARK-7275
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola
Assignee: Apache Spark
Priority: Minor

 It seems LogicalRelation is the only part of the LogicalPlan that is not 
 public. This makes it harder to work with full logical plans from third party 
 packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7322) Add DataFrame DSL for window function support

2015-05-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526166#comment-14526166
 ] 

Reynold Xin commented on SPARK-7322:


Yup.


 Add DataFrame DSL for window function support
 -

 Key: SPARK-7322
 URL: https://issues.apache.org/jira/browse/SPARK-7322
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 A good reference implementation: 
 http://www.jooq.org/doc/3.6/manual-single-page/#window-functions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526180#comment-14526180
 ] 

Apache Spark commented on SPARK-7330:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5877

 JDBC RDD could lead to NPE when the date field is null
 --

 Key: SPARK-7330
 URL: https://issues.apache.org/jira/browse/SPARK-7330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or 
 not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7330) JDBC RDD could lead to NPE when the date field is null

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7330:
---

Assignee: (was: Apache Spark)

 JDBC RDD could lead to NPE when the date field is null
 --

 Key: SPARK-7330
 URL: https://issues.apache.org/jira/browse/SPARK-7330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 because we call DateUtils.fromDate(rs.getDate(xx)) no matter it is null or 
 not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526179#comment-14526179
 ] 

Andrew Or commented on SPARK-6944:
--

https://github.com/apache/spark/pull/5729

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6944:
---

Assignee: Apache Spark  (was: Andrew Or)

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6944:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/5729)

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7331:
---
Description: 
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.

  was:
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {
/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */
val hContext = new Context(new HiveConf())
val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))
hContext.clear()
node
  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.


 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-7331:
--

 Summary: Create HiveConf per application instead of per query in 
HiveQl.scala
 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Nitin Goyal
Priority: Minor


A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {
/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */
val hContext = new Context(new HiveConf())
val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))
hContext.clear()
node
  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526229#comment-14526229
 ] 

Apache Spark commented on SPARK-3524:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5850

 remove workaround to pickle array of float for Pyrolite
 ---

 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Xiangrui Meng

 After Pyrolite release a new version with PR 
 https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
 introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3524:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 remove workaround to pickle array of float for Pyrolite
 ---

 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Apache Spark

 After Pyrolite release a new version with PR 
 https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
 introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3524:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 remove workaround to pickle array of float for Pyrolite
 ---

 Key: SPARK-3524
 URL: https://issues.apache.org/jira/browse/SPARK-3524
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Xiangrui Meng

 After Pyrolite release a new version with PR 
 https://github.com/irmen/Pyrolite/pull/11, we should remove the workaround 
 introduced in PR https://github.com/apache/spark/pull/2365



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7241) Pearson correlation for DataFrames

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7241.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Pearson correlation for DataFrames
 --

 Key: SPARK-7241
 URL: https://issues.apache.org/jira/browse/SPARK-7241
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
 Fix For: 1.4.0


 This JIRA is for computing the Pearson linear correlation for two numerical 
 columns in a DataFrame. The method `corr` should live under `df.stat`:
 {code}
 df.stat.corr(col1, col2, method=pearson): Double
 {code}
 `method` will be used when we add other correlations.
 Similar to SPARK-7240, UDAF will be added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6411:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 PySpark DataFrames can't be created if any datetimes have timezones
 ---

 Key: SPARK-6411
 URL: https://issues.apache.org/jira/browse/SPARK-6411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Harry Brundage
Assignee: Apache Spark

 I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
 objects that pass through the conversion process have a {{tzinfo}} property 
 set. 
 This works fine:
 {code}
 In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
 10),)]).toDF().collect()
 Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
 {code}
 as expected, the tuple's schema is inferred as having one anonymous column 
 with a datetime field, and the datetime roundtrips through to the Java side 
 python deserialization and then back into python land upon {{collect}}. This 
 however:
 {code}
 In [5]: from dateutil.tz import tzutc
 In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
 tzinfo=tzutc()),)]).toDF().collect()
 {code}
 explodes with
 {code}
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
 pickle data for datetime; expected 1 or 7 args, got 2
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   

[jira] [Commented] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526244#comment-14526244
 ] 

Apache Spark commented on SPARK-6411:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5850

 PySpark DataFrames can't be created if any datetimes have timezones
 ---

 Key: SPARK-6411
 URL: https://issues.apache.org/jira/browse/SPARK-6411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Harry Brundage
Assignee: Xiangrui Meng

 I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
 objects that pass through the conversion process have a {{tzinfo}} property 
 set. 
 This works fine:
 {code}
 In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
 10),)]).toDF().collect()
 Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
 {code}
 as expected, the tuple's schema is inferred as having one anonymous column 
 with a datetime field, and the datetime roundtrips through to the Java side 
 python deserialization and then back into python land upon {{collect}}. This 
 however:
 {code}
 In [5]: from dateutil.tz import tzutc
 In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
 tzinfo=tzutc()),)]).toDF().collect()
 {code}
 explodes with
 {code}
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
 pickle data for datetime; expected 1 or 7 args, got 2
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
   at 
 

[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6411:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 PySpark DataFrames can't be created if any datetimes have timezones
 ---

 Key: SPARK-6411
 URL: https://issues.apache.org/jira/browse/SPARK-6411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Harry Brundage
Assignee: Xiangrui Meng

 I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
 objects that pass through the conversion process have a {{tzinfo}} property 
 set. 
 This works fine:
 {code}
 In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
 10),)]).toDF().collect()
 Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
 {code}
 as expected, the tuple's schema is inferred as having one anonymous column 
 with a datetime field, and the datetime roundtrips through to the Java side 
 python deserialization and then back into python land upon {{collect}}. This 
 however:
 {code}
 In [5]: from dateutil.tz import tzutc
 In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
 tzinfo=tzutc()),)]).toDF().collect()
 {code}
 explodes with
 {code}
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
 pickle data for datetime; expected 1 or 7 args, got 2
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   

[jira] [Assigned] (SPARK-6411) PySpark DataFrames can't be created if any datetimes have timezones

2015-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-6411:


Assignee: Xiangrui Meng  (was: Davies Liu)

 PySpark DataFrames can't be created if any datetimes have timezones
 ---

 Key: SPARK-6411
 URL: https://issues.apache.org/jira/browse/SPARK-6411
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Harry Brundage
Assignee: Xiangrui Meng

 I am unable to create a DataFrame with PySpark if any of the {{datetime}} 
 objects that pass through the conversion process have a {{tzinfo}} property 
 set. 
 This works fine:
 {code}
 In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 
 10),)]).toDF().collect()
 Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))]
 {code}
 as expected, the tuple's schema is inferred as having one anonymous column 
 with a datetime field, and the datetime roundtrips through to the Java side 
 python deserialization and then back into python land upon {{collect}}. This 
 however:
 {code}
 In [5]: from dateutil.tz import tzutc
 In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, 
 tzinfo=tzutc()),)]).toDF().collect()
 {code}
 explodes with
 {code}
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid 
 pickle data for datetime; expected 1 or 7 args, got 2
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
   at 
 net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
   at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at 
 org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   

[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526261#comment-14526261
 ] 

Apache Spark commented on SPARK-6514:
-

User 'cfregly' has created a pull request for this issue:
https://github.com/apache/spark/pull/5882

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526262#comment-14526262
 ] 

Apache Spark commented on SPARK-5960:
-

User 'cfregly' has created a pull request for this issue:
https://github.com/apache/spark/pull/5882

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly
Priority: Blocker

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6514:
---

Assignee: Apache Spark

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly
Assignee: Apache Spark

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6656:
---

Assignee: Apache Spark

 Allow the application name to be passed in versus pulling from 
 SparkContext.getAppName() 
 -

 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Apache Spark

 this is useful for the scenario where Kinesis Spark Streaming is being 
 invoked from the Spark Shell.  in this case, the application name in the 
 SparkContext is pre-set to Spark Shell.
 this isn't a common or recommended use case, but it's best to make this 
 configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526263#comment-14526263
 ] 

Apache Spark commented on SPARK-6656:
-

User 'cfregly' has created a pull request for this issue:
https://github.com/apache/spark/pull/5882

 Allow the application name to be passed in versus pulling from 
 SparkContext.getAppName() 
 -

 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly

 this is useful for the scenario where Kinesis Spark Streaming is being 
 invoked from the Spark Shell.  in this case, the application name in the 
 SparkContext is pre-set to Spark Shell.
 this isn't a common or recommended use case, but it's best to make this 
 configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6514:
---

Assignee: (was: Apache Spark)

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6656:
---

Assignee: (was: Apache Spark)

 Allow the application name to be passed in versus pulling from 
 SparkContext.getAppName() 
 -

 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly

 this is useful for the scenario where Kinesis Spark Streaming is being 
 invoked from the Spark Shell.  in this case, the application name in the 
 SparkContext is pre-set to Spark Shell.
 this isn't a common or recommended use case, but it's best to make this 
 configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6028:
---
Priority: Critical  (was: Major)
Target Version/s: 1.5.0

 Provide an alternative RPC implementation based on the network transport 
 module
 ---

 Key: SPARK-6028
 URL: https://issues.apache.org/jira/browse/SPARK-6028
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Priority: Critical

 Network transport module implements a low level RPC interface. We can build a 
 new RPC implementation on top of that to replace Akka's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6280) Remove Akka systemName from Spark

2015-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6280:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Remove Akka systemName from Spark
 -

 Key: SPARK-6280
 URL: https://issues.apache.org/jira/browse/SPARK-6280
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu

 `systemName` is a Akka concept. A RPC implementation does not need to support 
 it. 
 We can hard code the system name in Spark and hide it in the internal Akka 
 RPC implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7329) Use itertools.product in ParamGridBuilder

2015-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7329.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5873
[https://github.com/apache/spark/pull/5873]

 Use itertools.product in ParamGridBuilder
 -

 Key: SPARK-7329
 URL: https://issues.apache.org/jira/browse/SPARK-7329
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.4.0


 justinuang suggested the following on 
 https://github.com/apache/spark/pull/5601:
 {code}
 [dict(zip(self._param_grid.keys(), prod)) for prod in 
 itertools.product(*self._param_grid.values())]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6943) Graphically show RDD's included in a stage

2015-05-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6943:
-
Attachment: job-page.png
stage-page.png

 Graphically show RDD's included in a stage
 --

 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, 
 stage-page.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7327) DataFrame show() method doesn't like empty dataframes

2015-05-03 Thread Chen Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526191#comment-14526191
 ] 

Chen Song commented on SPARK-7327:
--

I am working on upgrade the show function. I'll run the empty df test and fix 
it.

 DataFrame show() method doesn't like empty dataframes
 -

 Key: SPARK-7327
 URL: https://issues.apache.org/jira/browse/SPARK-7327
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Olivier Girardot
Priority: Minor

 For an empty DataFrame (for exemple after a filter) any call to show() ends 
 up with : 
 {code}
 java.util.MissingFormatWidthException: -0s
   at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906)
   at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680)
   at java.util.Formatter.parse(Formatter.java:2528)
   at java.util.Formatter.format(Formatter.java:2469)
   at java.util.Formatter.format(Formatter.java:2423)
   at java.lang.String.format(String.java:2790)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320)
 {code}
 If no-one takes it by next friday, I'll fix it, the problem seems to come 
 from the colWidths method :
 {code}
// Compute the width of each column
 val colWidths = Array.fill(numCols)(0)
 for (row - rows) {
   for ((cell, i) - row.zipWithIndex) {
 colWidths(i) = math.max(colWidths(i), cell.length)
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7024) Improve performance of function containsStar

2015-05-03 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi closed SPARK-7024.

Resolution: Not A Problem

 Improve performance of function containsStar
 

 Key: SPARK-7024
 URL: https://issues.apache.org/jira/browse/SPARK-7024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yadong Qi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements

2015-05-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526218#comment-14526218
 ] 

Cheng Lian commented on SPARK-6873:
---

The UDF synonyms lines are easy to be filtered out. Table properties lines can 
be fixed by sorting the results. The SerDe lines can be problematic, maybe we 
can ignore this case for now. If we plan to add Java 8 Jenkins builder, I can 
try to find some time to work on this after 1.4 release. Otherwise it's not 
that harmful for now. The time consuming part is regenerating all golden files 
since we need to change golden file generation and comparison strategies.

 Some Hive-Catalyst comparison tests fail due to unimportant order of some 
 printed elements
 --

 Key: SPARK-6873
 URL: https://issues.apache.org/jira/browse/SPARK-6873
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Cheng Lian
Priority: Minor

 As I mentioned, I've been seeing 4 test failures in Hive tests for a while, 
 and actually it still affects master. I think it's a superficial problem that 
 only turns up when running on Java 8, but still, would probably be an easy 
 fix and good to fix.
 Specifically, here are four tests and the bit that fails the comparison, 
 below. I tried to diagnose this but had trouble even finding where some of 
 this occurs, like the list of synonyms?
 {code}
 - show_tblproperties *** FAILED ***
   Results do not match for show_tblproperties:
 ...
   !== HIVE - 2 row(s) ==   == CATALYST - 2 row(s) ==
   !tmptruebar bar value
   !barbar value   tmp true (HiveComparisonTest.scala:391)
 {code}
 {code}
 - show_create_table_serde *** FAILED ***
   Results do not match for show_create_table_serde:
 ...
WITH SERDEPROPERTIES (  WITH 
 SERDEPROPERTIES ( 
   !  'serialization.format'='$', 
 'field.delim'=',', 
   !  'field.delim'=',')  
 'serialization.format'='$')
 {code}
 {code}
 - udf_std *** FAILED ***
   Results do not match for udf_std:
 ...
   !== HIVE - 2 row(s) == == CATALYST 
 - 2 row(s) ==
std(x) - Returns the standard deviation of a set of numbers   std(x) - 
 Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stddev  Synonyms: 
 stddev, stddev_pop (HiveComparisonTest.scala:391)
 {code}
 {code}
 - udf_stddev *** FAILED ***
   Results do not match for udf_stddev:
 ...
   !== HIVE - 2 row(s) ==== 
 CATALYST - 2 row(s) ==
stddev(x) - Returns the standard deviation of a set of numbers   stddev(x) 
 - Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stdSynonyms: 
 std, stddev_pop (HiveComparisonTest.scala:391)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526278#comment-14526278
 ] 

Sean Owen commented on SPARK-7326:
--

Makes sense, just interested in whether this was simplifiable, but it is beside 
the point.

 Performing window() on a WindowedDStream doesn't work all the time
 --

 Key: SPARK-7326
 URL: https://issues.apache.org/jira/browse/SPARK-7326
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Wesley Miao

 Someone reported similar issues before but got no response.
 http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html
 And I met the same issue recently and it can be reproduced in 1.3.1 by the 
 following piece of code:
 def main(args: Array[String]) {
 val batchInterval = 1234
 val sparkConf = new SparkConf()
   .setAppName(WindowOnWindowedDStream)
   .setMaster(local[2])
 val ssc =  new StreamingContext(sparkConf, 
 Milliseconds(batchInterval.toInt))
 ssc.checkpoint(checkpoint)
 def createRDD(i: Int) : RDD[(String, Int)] = {
   val count = 1000
   val rawLogs = (1 to count).map{ _ =
 val word = word + Random.nextInt.abs % 5
 (word, 1)
   }
   ssc.sparkContext.parallelize(rawLogs)
 }
 val rddQueue = mutable.Queue[RDD[(String, Int)]]()
 val rawLogStream = ssc.queueStream(rddQueue)
 (1 to 300) foreach { i =
   rddQueue.enqueue(createRDD(i))
 }
 val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
 Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)
 val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
 Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)
 l1.print()
 l2.print()
 ssc.start()
 ssc.awaitTermination()
 }
 Here we have two windowed DStream instance l1 and l2. 
 l1 is the result DStream by performing a window() on the source DStream with 
 both window and sliding duration 5 times the batch internal of the source 
 stream.
 l2 is the result DStream by performing a window() on l1, with both window and 
 sliding duration 3 times l1's batch interval, which is 15 times of the source 
 stream.
 From the output of this simple streaming app, I can only see print data 
 output from l1 and no data printed from l2.
 Diving into the source code, I found the problem may most likely reside in 
 DStream.slice() implementation, as shown below.
   def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
 if (!isInitialized) {
   throw new SparkException(this +  has not been initialized)
 }
 if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(fromTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(toTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 logInfo(Slicing from  + fromTime +  to  + toTime +
(aligned to  + alignedFromTime +  and  + alignedToTime + ))
 alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
   if (time = zeroTime) getOrCompute(time) else None
 })
   }
 Here after performing floor() on both fromTime and toTime, the result 
 (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
 multiple of the slidingDuration, thus making isTimeValid() check failed for 
 all the remaining computation.
 The fix would be to add a new floor() function in Time.scala to respect the 
 zeroTime while performing the floor :
   def floor(that: Duration, zeroTime: Time): Time = {
 val t = that.milliseconds
 new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
 zeroTime.milliseconds)
   } 
 And then change the DStream.slice to call this new floor function by passing 
 in its zeroTime.
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 This way the alignedToTime and alignedFromTime are *really* aligned in 
 respect to zeroTime whose value is not really a 0.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Wesley Miao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wesley Miao updated SPARK-7326:
---
Description: 
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)

val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)

l1.print()
l2.print()

ssc.start()
ssc.awaitTermination()
}

Here we have two windowed DStream instance l1 and l2. 

l1 is the result DStream by performing a window() on the source DStream with 
both window and sliding duration 5 times the batch internal of the source 
stream.

l2 is the result DStream by performing a window() on l1, with both window and 
sliding duration 3 times l1's batch interval, which is 15 times of the source 
stream.

From the output of this simple streaming app, I can only see print data output 
from l1 and no data printed from l2.

Diving into the source code, I found the problem may most likely reside in 
DStream.slice() implementation, as shown below.

  def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
  throw new SparkException(this +  has not been initialized)
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(fromTime ( + fromTime + ) is not a multiple of 
slideDuration (
+ slideDuration + ))
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration 
(
+ slideDuration + ))
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

logInfo(Slicing from  + fromTime +  to  + toTime +
   (aligned to  + alignedFromTime +  and  + alignedToTime + ))

alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
  if (time = zeroTime) getOrCompute(time) else None
})
  }

Here after performing floor() on both fromTime and toTime, the result 
(alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
multiple of the slidingDuration, thus making isTimeValid() check failed for all 
the remaining computation.

The fix would be to add a new floor() function in Time.scala to respect the 
zeroTime while performing the floor :

  def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
zeroTime.milliseconds)
  } 

And then change the DStream.slice to call this new floor function by passing in 
its zeroTime.

val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

This way the alignedToTime and alignedFromTime are *really* aligned in respect 
to zeroTime whose value is not really a 0.
 


  was:
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = 

[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7326:
-
Description: 
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

{code}
def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)

val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)

l1.print()
l2.print()

ssc.start()
ssc.awaitTermination()
}
{code}

Here we have two windowed DStream instance l1 and l2. 

l1 is the result DStream by performing a window() on the source DStream with 
both window and sliding duration 5 times the batch internal of the source 
stream.

l2 is the result DStream by performing a window() on l1, with both window and 
sliding duration 3 times l1's batch interval, which is 15 times of the source 
stream.

From the output of this simple streaming app, I can only see print data output 
from l1 and no data printed from l2.

Diving into the source code, I found the problem may most likely reside in 
DStream.slice() implementation, as shown below.

{code}
  def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
  throw new SparkException(this +  has not been initialized)
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(fromTime ( + fromTime + ) is not a multiple of 
slideDuration (
+ slideDuration + ))
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration 
(
+ slideDuration + ))
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

logInfo(Slicing from  + fromTime +  to  + toTime +
   (aligned to  + alignedFromTime +  and  + alignedToTime + ))

alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
  if (time = zeroTime) getOrCompute(time) else None
})
  }
{code}

Here after performing floor() on both fromTime and toTime, the result 
(alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
multiple of the slidingDuration, thus making isTimeValid check failed for all 
the remaining computation.

The fix would be to add a new floor() function in Time.scala to respect the 
zeroTime while performing the floor :

{code}
  def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
zeroTime.milliseconds)
  } 
{code}

And then change the DStream.slice to call this new floor function by passing in 
its zeroTime.

{code}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
{code}

This way the alignedToTime and alignedFromTime are *really* aligned in respect 
to zeroTime whose value is not really a 0.
 


  was:
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  

[jira] [Updated] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Wesley Miao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wesley Miao updated SPARK-7326:
---
Description: 
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5

(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)

val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)

l1.print()
l2.print()

ssc.start()
ssc.awaitTermination()
}

Here we have two windowed DStream instance l1 and l2. 

l1 is the result DStream by performing a window() on the source DStream with 
both window and sliding duration 5 times the batch internal of the source 
stream.

l2 is the result DStream by performing a window() on l1, with both window and 
sliding duration 3 times l1's batch interval, which is 15 times of the source 
stream.

From the output of this simple streaming app, I can only see print data output 
from l1 and no data printed from l2.

Diving into the source code, I found the problem may most likely reside in 
DStream.slice() implementation, as shown below.

  def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
  throw new SparkException(this +  has not been initialized)
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(fromTime ( + fromTime + ) is not a multiple of 
slideDuration (
+ slideDuration + ))
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
  logWarning(toTime ( + fromTime + ) is not a multiple of slideDuration 
(
+ slideDuration + ))
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

logInfo(Slicing from  + fromTime +  to  + toTime +
   (aligned to  + alignedFromTime +  and  + alignedToTime + ))

alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
  if (time = zeroTime) getOrCompute(time) else None
})
  }

Here after performing floor() on both fromTime and toTime, the result 
(alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
multiple of the slidingDuration, thus making isTimeValid() check failed for all 
the remaining computation.

The fix would be to add a new floor() function in Time.scala to respect the 
zeroTime while performing the floor :

  def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
zeroTime.milliseconds)
  } 

And then change the DStream.slice to call this new floor function by passing in 
its zeroTime.

val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)

This way the alignedToTime and alignedFromTime are *really* aligned in respect 
to zeroTime whose value is not really a 0.
 


  was:
Someone reported similar issues before but got no response.
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html

And I met the same issue recently and it can be reproduced in 1.3.1 by the 
following piece of code:

{code}
def main(args: Array[String]) {

val batchInterval = 1234
val sparkConf = new SparkConf()
  .setAppName(WindowOnWindowedDStream)
  .setMaster(local[2])

val ssc =  new StreamingContext(sparkConf, 
Milliseconds(batchInterval.toInt))
ssc.checkpoint(checkpoint)

def createRDD(i: Int) : RDD[(String, Int)] = {

  val count = 1000
  val rawLogs = (1 to count).map{ _ =
val word = word + Random.nextInt.abs % 5
(word, 1)
  }
  ssc.sparkContext.parallelize(rawLogs)
}

val rddQueue = mutable.Queue[RDD[(String, Int)]]()
val rawLogStream = ssc.queueStream(rddQueue)

(1 to 300) foreach { i =
  rddQueue.enqueue(createRDD(i))
}

val l1 = 

[jira] [Updated] (SPARK-7249) Updated Hadoop dependencies due to inconsistency in the versions

2015-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7249:
-
 Priority: Blocker  (was: Minor)
 Target Version/s: 1.4.0
Affects Version/s: 1.3.1
 Shepherd:   (was: Antonio Parcero)
   Labels:   (was: Build HADOOP dependencies upgrade)

I'm marking this blocker simply because I think it's not only useful to get 
this in before the next minor release, but kind of important as the Hadoop 2.2 
default we have in place today in the build is actually incomplete.

I'm not against pushing this back or toning it down but want it on the radar 
for a look before releasing 1.4.0.

  Updated Hadoop dependencies due to inconsistency in the versions
 -

 Key: SPARK-7249
 URL: https://issues.apache.org/jira/browse/SPARK-7249
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS 
 from cloudera 2.5.0-cdh5.3.3.
Reporter: Favio Vázquez
Priority: Blocker

 Updated Hadoop dependencies due to inconsistency in the versions. Now the 
 global properties are the ones used by the hadoop-2.2 profile, and the 
 profile was set to empty but kept for backwards compatibility reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525808#comment-14525808
 ] 

Apache Spark commented on SPARK-7326:
-

User 'wesleymiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5871

 Performing window() on a WindowedDStream doesn't work all the time
 --

 Key: SPARK-7326
 URL: https://issues.apache.org/jira/browse/SPARK-7326
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Wesley Miao

 Someone reported similar issues before but got no response.
 http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html
 And I met the same issue recently and it can be reproduced in 1.3.1 by the 
 following piece of code:
 def main(args: Array[String]) {
 val batchInterval = 1234
 val sparkConf = new SparkConf()
   .setAppName(WindowOnWindowedDStream)
   .setMaster(local[2])
 val ssc =  new StreamingContext(sparkConf, 
 Milliseconds(batchInterval.toInt))
 ssc.checkpoint(checkpoint)
 def createRDD(i: Int) : RDD[(String, Int)] = {
   val count = 1000
   val rawLogs = (1 to count).map{ _ =
 val word = word + Random.nextInt.abs % 5
 (word, 1)
   }
   ssc.sparkContext.parallelize(rawLogs)
 }
 val rddQueue = mutable.Queue[RDD[(String, Int)]]()
 val rawLogStream = ssc.queueStream(rddQueue)
 (1 to 300) foreach { i =
   rddQueue.enqueue(createRDD(i))
 }
 val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
 Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)
 val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
 Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)
 l1.print()
 l2.print()
 ssc.start()
 ssc.awaitTermination()
 }
 Here we have two windowed DStream instance l1 and l2. 
 l1 is the result DStream by performing a window() on the source DStream with 
 both window and sliding duration 5 times the batch internal of the source 
 stream.
 l2 is the result DStream by performing a window() on l1, with both window and 
 sliding duration 3 times l1's batch interval, which is 15 times of the source 
 stream.
 From the output of this simple streaming app, I can only see print data 
 output from l1 and no data printed from l2.
 Diving into the source code, I found the problem may most likely reside in 
 DStream.slice() implementation, as shown below.
   def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
 if (!isInitialized) {
   throw new SparkException(this +  has not been initialized)
 }
 if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(fromTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(toTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 logInfo(Slicing from  + fromTime +  to  + toTime +
(aligned to  + alignedFromTime +  and  + alignedToTime + ))
 alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
   if (time = zeroTime) getOrCompute(time) else None
 })
   }
 Here after performing floor() on both fromTime and toTime, the result 
 (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
 multiple of the slidingDuration, thus making isTimeValid() check failed for 
 all the remaining computation.
 The fix would be to add a new floor() function in Time.scala to respect the 
 zeroTime while performing the floor :
   def floor(that: Duration, zeroTime: Time): Time = {
 val t = that.milliseconds
 new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
 zeroTime.milliseconds)
   } 
 And then change the DStream.slice to call this new floor function by passing 
 in its zeroTime.
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 This way the alignedToTime and alignedFromTime are *really* aligned in 
 respect to zeroTime whose value is not really a 0.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1437) Jenkins should build with Java 6

2015-05-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525761#comment-14525761
 ] 

Steve Loughran commented on SPARK-1437:
---

..be good for the pull request test runs to use java6 too; I've already had to 
purge some java7-isms that didn't get picked up

 Jenkins should build with Java 6
 

 Key: SPARK-1437
 URL: https://issues.apache.org/jira/browse/SPARK-1437
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor
  Labels: javac, jenkins
 Attachments: Screen Shot 2014-04-07 at 22.53.56.png


 Apologies if this was already on someone's to-do list, but I wanted to track 
 this, as it bit two commits in the last few weeks.
 Spark is intended to work with Java 6, and so compiles with source/target 
 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 
 bytecode. However, unless otherwise configured with -bootclasspath, javac 
 will use its own (Java 7) library classes. This means code that uses classes 
 in Java 7 will be allowed to compile, but the result will fail when run on 
 Java 6.
 This is why you get warnings like ...
 Using /usr/java/jdk1.7.0_51 as default JAVA_HOME.
 ...
 [warn] warning: [options] bootstrap class path not set in conjunction with 
 -source 1.6
 The solution is just to tell Jenkins to use Java 6. This may be stating the 
 obvious, but it should just be a setting under Configure for 
 SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not 
 an option already I'm guessing it's not too hard to get Java 6 configured on 
 the Amplab machines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7326) Performing window() on a WindowedDStream doesn't work all the time

2015-05-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525815#comment-14525815
 ] 

Sean Owen commented on SPARK-7326:
--

Out of curiosity why do you have a window + slide of 5x the batch duration? 
that is simply the same as making an underlying stream with 5x the batch 
duration. Maybe this is just a toy example. I know it's not directly relevant 
to what you're reporting.

 Performing window() on a WindowedDStream doesn't work all the time
 --

 Key: SPARK-7326
 URL: https://issues.apache.org/jira/browse/SPARK-7326
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Wesley Miao

 Someone reported similar issues before but got no response.
 http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-td466.html
 And I met the same issue recently and it can be reproduced in 1.3.1 by the 
 following piece of code:
 def main(args: Array[String]) {
 val batchInterval = 1234
 val sparkConf = new SparkConf()
   .setAppName(WindowOnWindowedDStream)
   .setMaster(local[2])
 val ssc =  new StreamingContext(sparkConf, 
 Milliseconds(batchInterval.toInt))
 ssc.checkpoint(checkpoint)
 def createRDD(i: Int) : RDD[(String, Int)] = {
   val count = 1000
   val rawLogs = (1 to count).map{ _ =
 val word = word + Random.nextInt.abs % 5
 (word, 1)
   }
   ssc.sparkContext.parallelize(rawLogs)
 }
 val rddQueue = mutable.Queue[RDD[(String, Int)]]()
 val rawLogStream = ssc.queueStream(rddQueue)
 (1 to 300) foreach { i =
   rddQueue.enqueue(createRDD(i))
 }
 val l1 = rawLogStream.window(Milliseconds(batchInterval.toInt) * 5, 
 Milliseconds(batchInterval.toInt) * 5).reduceByKey(_ + _)
 val l2 = l1.window(Milliseconds(batchInterval.toInt) * 15, 
 Milliseconds(batchInterval.toInt) * 15).reduceByKey(_ + _)
 l1.print()
 l2.print()
 ssc.start()
 ssc.awaitTermination()
 }
 Here we have two windowed DStream instance l1 and l2. 
 l1 is the result DStream by performing a window() on the source DStream with 
 both window and sliding duration 5 times the batch internal of the source 
 stream.
 l2 is the result DStream by performing a window() on l1, with both window and 
 sliding duration 3 times l1's batch interval, which is 15 times of the source 
 stream.
 From the output of this simple streaming app, I can only see print data 
 output from l1 and no data printed from l2.
 Diving into the source code, I found the problem may most likely reside in 
 DStream.slice() implementation, as shown below.
   def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
 if (!isInitialized) {
   throw new SparkException(this +  has not been initialized)
 }
 if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(fromTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
   logWarning(toTime ( + fromTime + ) is not a multiple of 
 slideDuration (
 + slideDuration + ))
 }
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 logInfo(Slicing from  + fromTime +  to  + toTime +
(aligned to  + alignedFromTime +  and  + alignedToTime + ))
 alignedFromTime.to(alignedToTime, slideDuration).flatMap(time = {
   if (time = zeroTime) getOrCompute(time) else None
 })
   }
 Here after performing floor() on both fromTime and toTime, the result 
 (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be 
 multiple of the slidingDuration, thus making isTimeValid() check failed for 
 all the remaining computation.
 The fix would be to add a new floor() function in Time.scala to respect the 
 zeroTime while performing the floor :
   def floor(that: Duration, zeroTime: Time): Time = {
 val t = that.milliseconds
 new Time(((this.millis - zeroTime.milliseconds) / t) * t + 
 zeroTime.milliseconds)
   } 
 And then change the DStream.slice to call this new floor function by passing 
 in its zeroTime.
 val alignedToTime = toTime.floor(slideDuration, zeroTime)
 val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
 This way the alignedToTime and alignedFromTime are *really* aligned in 
 respect to zeroTime whose value is not really a 0.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525903#comment-14525903
 ] 

Josh Rosen commented on SPARK-4105:
---

While working on some new shuffle code, I managed to trigger a 
{{FAILED_TO_UNCOMPRESS(5)}} error by trying to decompress data which was not 
compressed or which was compressed with the wrong compression codec. This is 
kind of a long shot, but I wonder if there's a rarely-hit branch in the 
sort-shuffle write path that doesn't properly wrap an output stream for 
compression (or that does so with the wrong compression codec).

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 

[jira] [Updated] (SPARK-4106) Shuffle write and spill to disk metrics are incorrect

2015-05-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4106:
--
Component/s: Shuffle

 Shuffle write and spill to disk metrics are incorrect
 -

 Key: SPARK-4106
 URL: https://issues.apache.org/jira/browse/SPARK-4106
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Critical

 I have an encountered a job which has some disk spilled (memory) but the disk 
 spilled (disk) is 0, as well as the shuffle write. If I switch to hash based 
 shuffle, where there happens to be no disk spilling, then the shuffle write 
 is correct.
 I can get more info on a workload to repro this situation, but perhaps that 
 state of events is sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7329) Use itertools.product in ParamGridBuilder

2015-05-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7329:


 Summary: Use itertools.product in ParamGridBuilder
 Key: SPARK-7329
 URL: https://issues.apache.org/jira/browse/SPARK-7329
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


justinuang suggested the following on https://github.com/apache/spark/pull/5601:

{code}
[dict(zip(self._param_grid.keys(), prod)) for prod in 
itertools.product(*self._param_grid.values())]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7022) PySpark is missing ParamGridBuilder

2015-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7022.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5601
[https://github.com/apache/spark/pull/5601]

 PySpark is missing ParamGridBuilder
 ---

 Key: SPARK-7022
 URL: https://issues.apache.org/jira/browse/SPARK-7022
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz
Assignee: Omede Firouz
Priority: Critical
 Fix For: 1.4.0


 PySpark is missing the entirety of ML.Tuning (see: 
 https://issues.apache.org/jira/browse/SPARK-6940)
 This is a subticket specifically to track the ParamGridBuilder. The 
 CrossValidator will be dealt with in a followup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7329) Use itertools.product in ParamGridBuilder

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525962#comment-14525962
 ] 

Apache Spark commented on SPARK-7329:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5873

 Use itertools.product in ParamGridBuilder
 -

 Key: SPARK-7329
 URL: https://issues.apache.org/jira/browse/SPARK-7329
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 justinuang suggested the following on 
 https://github.com/apache/spark/pull/5601:
 {code}
 [dict(zip(self._param_grid.keys(), prod)) for prod in 
 itertools.product(*self._param_grid.values())]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7329) Use itertools.product in ParamGridBuilder

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7329:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Use itertools.product in ParamGridBuilder
 -

 Key: SPARK-7329
 URL: https://issues.apache.org/jira/browse/SPARK-7329
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 justinuang suggested the following on 
 https://github.com/apache/spark/pull/5601:
 {code}
 [dict(zip(self._param_grid.keys(), prod)) for prod in 
 itertools.product(*self._param_grid.values())]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4112) Have a reserved copy of Sorter/SortDataFormat

2015-05-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4112:
--
Component/s: Shuffle

 Have a reserved copy of Sorter/SortDataFormat
 -

 Key: SPARK-4112
 URL: https://issues.apache.org/jira/browse/SPARK-4112
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng

 After SPARK-4084, developers can use Sorter with their own SortDataFormat. 
 However, if there are multiple subclasses of SortDataFormat instantiated, JIT 
 won't inline the methods in SortDataFormat and virtual method table lookup is 
 slow. One solution could be making two copies of the code and reserve one for 
 shuffle only, and expose the other to developers.
 Before we do that, we should compare the performance with/without JIT and 
 check whether it is worth the extra code complexity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5392) Shuffle spill size is shown as negative

2015-05-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5392:
--
Component/s: Shuffle

 Shuffle spill size is shown as negative
 ---

 Key: SPARK-5392
 URL: https://issues.apache.org/jira/browse/SPARK-5392
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Web UI
Affects Versions: 1.2.0
Reporter: Sven Krasser
Priority: Minor
 Attachments: Screen Shot 2015-01-23 at 5.13.55 PM.png


 The Shuffle Spill (Memory) metric on the Stage Detail Web UI shows as 
 negative for some executors (e.g. -2097152.0 B), see attached screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7328) Add missing items to pyspark.mllib.linalg.Vectors

2015-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525921#comment-14525921
 ] 

Apache Spark commented on SPARK-7328:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/5872

 Add missing items to pyspark.mllib.linalg.Vectors
 -

 Key: SPARK-7328
 URL: https://issues.apache.org/jira/browse/SPARK-7328
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 Add
 1. Class methods squared_dist, dot
 2. toString
 3. parse
 4. norm
 5. numNonzeros
 6. copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7328) Add missing items to pyspark.mllib.linalg.Vectors

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7328:
---

Assignee: Apache Spark

 Add missing items to pyspark.mllib.linalg.Vectors
 -

 Key: SPARK-7328
 URL: https://issues.apache.org/jira/browse/SPARK-7328
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark

 Add
 1. Class methods squared_dist, dot
 2. toString
 3. parse
 4. norm
 5. numNonzeros
 6. copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7329) Use itertools.product in ParamGridBuilder

2015-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7329:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Use itertools.product in ParamGridBuilder
 -

 Key: SPARK-7329
 URL: https://issues.apache.org/jira/browse/SPARK-7329
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Minor

 justinuang suggested the following on 
 https://github.com/apache/spark/pull/5601:
 {code}
 [dict(zip(self._param_grid.keys(), prod)) for prod in 
 itertools.product(*self._param_grid.values())]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-05-03 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525927#comment-14525927
 ] 

Bryan Cutler commented on SPARK-6980:
-

I added another commit to the PR and some basic unit tests.

 Akka timeout exceptions indicate which conf controls them
 -

 Key: SPARK-6980
 URL: https://issues.apache.org/jira/browse/SPARK-6980
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Harsh Gupta
Priority: Minor
  Labels: starter
 Attachments: Spark-6980-Test.scala


 If you hit one of the akka timeouts, you just get an exception like
 {code}
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 {code}
 The exception doesn't indicate how to change the timeout, though there is 
 usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
 nice if the exception including the relevant setting.
 I think this should be pretty easy to do -- we just need to create something 
 like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
 akka timeout and throws its own exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5989) Model import/export for LDAModel

2015-05-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525931#comment-14525931
 ] 

Manoj Kumar commented on SPARK-5989:


Oh just saw this, but there are a few other PR's of mine in line for review. 
Would get back after they are done. :)

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org