[jira] [Commented] (SPARK-9460) Avoid byte array allocation in StringPrefixComparator

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647305#comment-14647305
 ] 

Apache Spark commented on SPARK-9460:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7789

 Avoid byte array allocation in StringPrefixComparator
 -

 Key: SPARK-9460
 URL: https://issues.apache.org/jira/browse/SPARK-9460
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 StringPrefixComparator converts the long values back to byte arrays in order 
 to compare them. We should be able to optimize this to compare the longs 
 directly, rather than turning the longs into byte arrays and comparing them 
 byte by byte. 
 {code}
 public int compare(long aPrefix, long bPrefix) {
   // TODO: can done more efficiently
   byte[] a = Longs.toByteArray(aPrefix);
   byte[] b = Longs.toByteArray(bPrefix);
   for (int i = 0; i  8; i++) {
 int c = UnsignedBytes.compare(a[i], b[i]);
 if (c != 0) return c;
   }
   return 0;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9335) Kinesis test hits rate limit

2015-07-30 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9335.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Kinesis test hits rate limit
 

 Key: SPARK-9335
 URL: https://issues.apache.org/jira/browse/SPARK-9335
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.5.0


 This test is failing many pull request builds because of rate limits:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38396/testReport/org.apache.spark.streaming.kinesis/KinesisBackedBlockRDDSuite/_It_is_not_a_test_/
 I disabled the test. I wonder if it's better to not have this test run by 
 default since it's a bit brittle to depend on an external system like this 
 (what if Kinesis goes down, for instance, it will block all development).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-07-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647313#comment-14647313
 ] 

Jeff Zhang commented on SPARK-8167:
---

[~mcheah] What's the status of this ticket ?  I don't think blocking RPC call 
is a good idea.  I think we could just send executor preempted message to 
driver when the container is preempted. And let driver to decrease the 
numTaskAttemptFails. Although we lose some consistency here, at least we could 
avoid job failures due to preemption. And I think there's some gap between 2 
consecutive failed task attempt, very likely in the gap the driver has received 
the executor preempted message.  Thoughts ?

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Assignee: Matt Cheah
Priority: Blocker

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Rahul Kavale (JIRA)
Rahul Kavale created SPARK-9470:
---

 Summary: Java API function interface cleanup
 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial


Hi guys,
I was exploring Spark codebase, and came across the Java API function 
interfaces. The interfaces have the 'call' method as 'public' which is 
redundant.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements

2015-07-30 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647322#comment-14647322
 ] 

Pete Robbins commented on SPARK-6873:
-

We've been trying to get a clean build/test using Java 8 and we see these 
errors so I think this is still a problem.

It looks like the Catalyst output changes from Java 7 to Java 8. Is the 
ordering supposed to be defined for this or is the ordering really unimportant?

 Some Hive-Catalyst comparison tests fail due to unimportant order of some 
 printed elements
 --

 Key: SPARK-6873
 URL: https://issues.apache.org/jira/browse/SPARK-6873
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Cheng Lian
Priority: Minor

 As I mentioned, I've been seeing 4 test failures in Hive tests for a while, 
 and actually it still affects master. I think it's a superficial problem that 
 only turns up when running on Java 8, but still, would probably be an easy 
 fix and good to fix.
 Specifically, here are four tests and the bit that fails the comparison, 
 below. I tried to diagnose this but had trouble even finding where some of 
 this occurs, like the list of synonyms?
 {code}
 - show_tblproperties *** FAILED ***
   Results do not match for show_tblproperties:
 ...
   !== HIVE - 2 row(s) ==   == CATALYST - 2 row(s) ==
   !tmptruebar bar value
   !barbar value   tmp true (HiveComparisonTest.scala:391)
 {code}
 {code}
 - show_create_table_serde *** FAILED ***
   Results do not match for show_create_table_serde:
 ...
WITH SERDEPROPERTIES (  WITH 
 SERDEPROPERTIES ( 
   !  'serialization.format'='$', 
 'field.delim'=',', 
   !  'field.delim'=',')  
 'serialization.format'='$')
 {code}
 {code}
 - udf_std *** FAILED ***
   Results do not match for udf_std:
 ...
   !== HIVE - 2 row(s) == == CATALYST 
 - 2 row(s) ==
std(x) - Returns the standard deviation of a set of numbers   std(x) - 
 Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stddev  Synonyms: 
 stddev, stddev_pop (HiveComparisonTest.scala:391)
 {code}
 {code}
 - udf_stddev *** FAILED ***
   Results do not match for udf_stddev:
 ...
   !== HIVE - 2 row(s) ==== 
 CATALYST - 2 row(s) ==
stddev(x) - Returns the standard deviation of a set of numbers   stddev(x) 
 - Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stdSynonyms: 
 std, stddev_pop (HiveComparisonTest.scala:391)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg

2015-07-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647862#comment-14647862
 ] 

Xiangrui Meng commented on SPARK-9408:
--

If we want to have distributed matrix API in Python, this is required.

 Refactor mllib/linalg.py to mllib/linalg
 

 Key: SPARK-9408
 URL: https://issues.apache.org/jira/browse/SPARK-9408
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar

 We need to refactor mllib/linalg.py to mllib/linalg so that the project 
 structure is similar to that of Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer

2015-07-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647904#comment-14647904
 ] 

yuhao yang commented on SPARK-7583:
---

I'd like to take a try if this is still needed.

 User guide update for RegexTokenizer
 

 Key: SPARK-7583
 URL: https://issues.apache.org/jira/browse/SPARK-7583
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9277.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7794
[https://github.com/apache/spark/pull/7794]

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647884#comment-14647884
 ] 

Jeremy Freeman commented on SPARK-9461:
---

Interesting, the lack of failures on KMeans is consistent with the completion 
idea because  those tests use extremely toy data ( 10 data points) whereas the 
regression ones use 100s of test data points.

In this (and similar) lines:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/tests.py#L1161

there's a parameter `end_time` that's the time to wait in seconds for it to 
complete. Looking across these tests, the value fluctuates (5, 10, 15, 20) 
suggesting that it was hand-tuned, possibly tailored to a local test 
environment. Bumping that number up for any of the tests showing occasional 
errors might fix it, though that's a little ad-hoc.

I think things are more robust on the Scala side because there's a full-blown 
streaming test class that lets test jobs either run to completion, or until a 
max time out 
(https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala).
 So there's just one test-wide parameter, the max time out, and we could safely 
set that pretty high without wasting time.

 Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
 --

 Key: SPARK-9461
 URL: https://issues.apache.org/jira/browse/SPARK-9461
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Jeremy Freeman

 [~freeman-lab]
 Check out this failure: 
 [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
 It should be deterministic, but do you think it's just slight variations 
 caused by the Python version?  Or do you think it's something odd going on 
 with streaming?  This is the only time I've seen this happen, but I'll post 
 again if I see it more.
 Test failure message:
 {code}
 ==
 FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
 Test that coefs are predicted accurately by fitting on toy data.
 --
 Traceback (most recent call last):
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1282, in test_parameter_accuracy
 slr.latestModel().weights.array, [10., 10.], 1)
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1257, in assertArrayAlmostEqual
 self.assertAlmostEqual(i, j, dec)
 AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-07-30 Thread K S Sreenivasa Raghavan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647893#comment-14647893
 ] 

K S Sreenivasa Raghavan commented on SPARK-6227:


Hi,
I have worked with pySpark in EdX courses. The course coordinators distributed 
some sparkVM to all participants of the course. I am interested in developing 
this package. I have even learnt scala. I have few doubts:

1.  Please give me proper steps to install spark on my ubuntu desktop as I have 
no idea how to modify the spark code in VM. I tried all the methods given (as 
given by google search), But they failed.
2. For Pyspark, where should we write/ modify the codes?


 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2089.
--
Resolution: Won't Fix

 With YARN, preferredNodeLocalityData isn't honored 
 ---

 Key: SPARK-2089
 URL: https://issues.apache.org/jira/browse/SPARK-2089
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

 When running in YARN cluster mode, apps can pass preferred locality data when 
 constructing a Spark context that will dictate where to request executor 
 containers.
 This is currently broken because of a race condition.  The Spark-YARN code 
 runs the user class and waits for it to start up a SparkContext.  During its 
 initialization, the SparkContext will create a YarnClusterScheduler, which 
 notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
 immediately fetches the preferredNodeLocationData from the SparkContext and 
 uses it to start requesting containers.
 But in the SparkContext constructor that takes the preferredNodeLocationData, 
 setting preferredNodeLocationData comes after the rest of the initialization, 
 so, if the Spark-YARN code comes around quickly enough after being notified, 
 the data that's fetched is the empty unset version.  The occurred during all 
 of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8175) date/time function: from_unixtime

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8175.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7644
[https://github.com/apache/spark/pull/7644]

 date/time function: from_unixtime
 -

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.5.0


 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8174) date/time function: unix_timestamp

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8174.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7644
[https://github.com/apache/spark/pull/7644]

 date/time function: unix_timestamp
 --

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
Priority: Blocker
 Fix For: 1.5.0


 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string|date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4449) specify port range in spark

2015-07-30 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648090#comment-14648090
 ] 

Neelesh Srinivas Salian commented on SPARK-4449:


I would like to pick this up and work on it.

Could you please assign the JIRA to me?

Thank you.


 specify port range in spark
 ---

 Key: SPARK-4449
 URL: https://issues.apache.org/jira/browse/SPARK-4449
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Fei Wang
Priority: Minor

  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9231) DistributedLDAModel method for top topics per document

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9231:
---

Assignee: Apache Spark

 DistributedLDAModel method for top topics per document
 --

 Key: SPARK-9231
 URL: https://issues.apache.org/jira/browse/SPARK-9231
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 Helper method in DistributedLDAModel of this form:
 {code}
 /**
  * For each document, return the top k weighted topics for that document.
  * @return RDD of (doc ID, topic indices, topic weights)
  */
 def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
 {code}
 I believe the above method signature will be Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9231) DistributedLDAModel method for top topics per document

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9231:
---

Assignee: (was: Apache Spark)

 DistributedLDAModel method for top topics per document
 --

 Key: SPARK-9231
 URL: https://issues.apache.org/jira/browse/SPARK-9231
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 Helper method in DistributedLDAModel of this form:
 {code}
 /**
  * For each document, return the top k weighted topics for that document.
  * @return RDD of (doc ID, topic indices, topic weights)
  */
 def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
 {code}
 I believe the above method signature will be Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9428) Add test cases for null inputs for expression unit tests

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9428.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7748
[https://github.com/apache/spark/pull/7748]

 Add test cases for null inputs for expression unit tests
 

 Key: SPARK-9428
 URL: https://issues.apache.org/jira/browse/SPARK-9428
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yijie Shen
Assignee: Yijie Shen
Priority: Blocker
 Fix For: 1.5.0


 We need to audit expression unit tests to make sure we pass in null inputs to 
 test null behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8786) Create a wrapper for BinaryType

2015-07-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647248#comment-14647248
 ] 

Takeshi Yamamuro edited comment on SPARK-8786 at 7/30/15 6:28 AM:
--

Sorry to make you confused though, a master branch in spark does;

{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema = StructType(StructField(x, BinaryType, nullable=false) :: Nil)
val data = sc.parallelize(Row(Array[Byte](1.toByte)) :: 
Row(Array[Byte](1.toByte)) :: Row(Array[Byte](2.toByte)) :: Nil)
val df = sqlContext.createDataFrame(data, schema)

df.registerTempTable(test)
sqlContext.sql(SELECT DISTINCT x FROM test).show()

+---+
|  x|
+---+
|[1]|
|[2]|
+---+
{code}


was (Author: maropu):
Sorry to make you confused though, a master branch in spark does;

```
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema = StructType(StructField(x, BinaryType, nullable=false) :: Nil)
val data = sc.parallelize(Row(Array[Byte](1.toByte)) :: 
Row(Array[Byte](1.toByte)) :: Row(Array[Byte](2.toByte)) :: Nil)
val df = sqlContext.createDataFrame(data, schema)

df.registerTempTable(test)
sqlContext.sql(SELECT DISTINCT x FROM test).show()

+---+
|  x|
+---+
|[1]|
|[2]|
+---+
```

 Create a wrapper for BinaryType
 ---

 Key: SPARK-8786
 URL: https://issues.apache.org/jira/browse/SPARK-8786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 The hashCode and equals() of Array[Byte] does check the bytes, we should 
 create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9464) Add property-based tests for UTF8String

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9464:
---

Assignee: Apache Spark

 Add property-based tests for UTF8String
 ---

 Key: SPARK-9464
 URL: https://issues.apache.org/jira/browse/SPARK-9464
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Critical

 UTF8String is a class that can benefit from ScalaCheck-style property checks. 
 Let's add these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9464) Add property-based tests for UTF8String

2015-07-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9464:
--
Assignee: (was: Josh Rosen)

 Add property-based tests for UTF8String
 ---

 Key: SPARK-9464
 URL: https://issues.apache.org/jira/browse/SPARK-9464
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen

 UTF8String is a class that can benefit from ScalaCheck-style property checks. 
 Let's add these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9464) Add property-based tests for UTF8String

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9464:
---

Assignee: (was: Apache Spark)

 Add property-based tests for UTF8String
 ---

 Key: SPARK-9464
 URL: https://issues.apache.org/jira/browse/SPARK-9464
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen
Priority: Critical

 UTF8String is a class that can benefit from ScalaCheck-style property checks. 
 Let's add these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9464) Add property-based tests for UTF8String

2015-07-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647252#comment-14647252
 ] 

Josh Rosen commented on SPARK-9464:
---

Unassigning this myself since I don't have time to work on this in the short 
term.   Feel free to use my WIP PR as a starting point.

 Add property-based tests for UTF8String
 ---

 Key: SPARK-9464
 URL: https://issues.apache.org/jira/browse/SPARK-9464
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen

 UTF8String is a class that can benefit from ScalaCheck-style property checks. 
 Let's add these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9464) Add property-based tests for UTF8String

2015-07-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9464:
--
Target Version/s: 1.5.0
Priority: Critical  (was: Major)

 Add property-based tests for UTF8String
 ---

 Key: SPARK-9464
 URL: https://issues.apache.org/jira/browse/SPARK-9464
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen
Priority: Critical

 UTF8String is a class that can benefit from ScalaCheck-style property checks. 
 Let's add these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7264) SparkR API for parallel functions

2015-07-30 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647268#comment-14647268
 ] 

Rick Moritz commented on SPARK-7264:


I've also added a bit of commentary.

 SparkR API for parallel functions
 -

 Key: SPARK-7264
 URL: https://issues.apache.org/jira/browse/SPARK-7264
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 This is a JIRA to discuss design proposals for enabling parallel R 
 computation in SparkR without exposing the entire RDD API. 
 The rationale for this is that the RDD API has a number of low level 
 functions and we would like to expose a more light-weight API that is both 
 friendly to R users and easy to maintain.
 http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9231) DistributedLDAModel method for top topics per document

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647233#comment-14647233
 ] 

Apache Spark commented on SPARK-9231:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/7785

 DistributedLDAModel method for top topics per document
 --

 Key: SPARK-9231
 URL: https://issues.apache.org/jira/browse/SPARK-9231
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 Helper method in DistributedLDAModel of this form:
 {code}
 /**
  * For each document, return the top k weighted topics for that document.
  * @return RDD of (doc ID, topic indices, topic weights)
  */
 def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
 {code}
 I believe the above method signature will be Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8005) Support INPUT__FILE__NAME virtual column

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8005.

   Resolution: Fixed
 Assignee: Joseph Batchik
Fix Version/s: 1.5.0

 Support INPUT__FILE__NAME virtual column
 

 Key: SPARK-8005
 URL: https://issues.apache.org/jira/browse/SPARK-8005
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik
 Fix For: 1.5.0


 INPUT__FILE__NAME: input file name.
 One way to do this is to do it through a thread local variable in the 
 SqlNewHadoopRDD.scala, and read that thread local variable in an expression. 
 (similar to SparkPartitionID expression)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8007.

Resolution: Won't Fix

These are now just functions.


 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9396) Spark yarn allocator does not call removeContainerRequest for allocated Container requests, resulting in bloated ask[] toYarn RM.

2015-07-30 Thread prakhar jauhari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647258#comment-14647258
 ] 

prakhar jauhari commented on SPARK-9396:


Can you please assign this issue to me, I am adding a PR.

 Spark yarn allocator does not call removeContainerRequest for allocated 
 Container requests, resulting in bloated ask[] toYarn RM.
 ---

 Key: SPARK-9396
 URL: https://issues.apache.org/jira/browse/SPARK-9396
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
 Environment: Spark-1.2.1 on hadoop-yarn-2.4.0 cluster. All servers in 
 cluster running Linux version 2.6.32.
Reporter: prakhar jauhari

 Note : Attached logs contain logs that i added (spark yarn allocator side and 
 Yarn client side) for debugging purpose.
 ! My spark job is configured for 2 executors, on killing 1 executor the 
 ask is of 3 !!!
 On killing a executor  - resource request logs :
 *Killed container: ask for 3 containers, instead for 1***
 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Will allocate 1 executor 
 containers, each with 2432 MB memory including 384 MB overhead
 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: numExecutors: 1
 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: host preferences is empty
 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Container request (host: 
 Any, priority: 1, capability: memory:2432, vCores:4
 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : 
 allocate: this.ask = [{Priority: 1, Capability: memory:2432, vCores:4, # 
 Containers: 3, Location: *, Relax Locality: true}]
 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : 
 allocate: allocateRequest = ask { priority{ priority: 1 } resource_name: * 
 capability { memory: 2432 virtual_cores: 4 } num_containers: 3 
 relax_locality: true } blacklist_request { } response_id: 354 progress: 0.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8002) Support virtual columns in SQL and DataFrames

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8002.

   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 1.5.0

We ended up just creating functions to support these.


 Support virtual columns in SQL and DataFrames
 -

 Key: SPARK-8002
 URL: https://issues.apache.org/jira/browse/SPARK-8002
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647289#comment-14647289
 ] 

Apache Spark commented on SPARK-6319:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7787

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Priority: Critical

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6319:
---

Assignee: (was: Apache Spark)

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Priority: Critical

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6319:
---

Assignee: Apache Spark

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Critical

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9239:
---

Assignee: Apache Spark

 HiveUDAF support for AggregateFunction2
 ---

 Key: SPARK-9239
 URL: https://issues.apache.org/jira/browse/SPARK-9239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker

 We need to build a wrapper for Hive UDAFs on top of AggregateFunction2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9239:
---

Assignee: (was: Apache Spark)

 HiveUDAF support for AggregateFunction2
 ---

 Key: SPARK-9239
 URL: https://issues.apache.org/jira/browse/SPARK-9239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 We need to build a wrapper for Hive UDAFs on top of AggregateFunction2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647291#comment-14647291
 ] 

Apache Spark commented on SPARK-9239:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7788

 HiveUDAF support for AggregateFunction2
 ---

 Key: SPARK-9239
 URL: https://issues.apache.org/jira/browse/SPARK-9239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 We need to build a wrapper for Hive UDAFs on top of AggregateFunction2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647848#comment-14647848
 ] 

Sean Owen commented on SPARK-9477:
--

It's not part of Spark or supported by the project; IMHO no it would not belong 
in the Spark docs. Standalone/YARN/Mesos are directly supported by code within 
Spark.

I think the closest thing Spark has to that is a powered by wiki page, which 
lists third-party projects/products/services related to Spark: 
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9277:
-
Assignee: Sean Owen

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Assignee: Sean Owen
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9248) Closing curly-braces should always be on their own line

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9248:
-
Assignee: Yu Ishikawa

 Closing curly-braces should always be on their own line
 ---

 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Yu Ishikawa
Priority: Minor
 Fix For: 1.5.0


 Closing curly-braces should always be on their own line
 For example,
 {noformat}
 inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
 be on their own line, unless it's followed by an else.
   }, error = function(err) {
   ^
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9248) Closing curly-braces should always be on their own line

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9248.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7795
[https://github.com/apache/spark/pull/7795]

 Closing curly-braces should always be on their own line
 ---

 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor
 Fix For: 1.5.0


 Closing curly-braces should always be on their own line
 For example,
 {noformat}
 inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
 be on their own line, unless it's followed by an else.
   }, error = function(err) {
   ^
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).

2015-07-30 Thread Stacy Pedersen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647886#comment-14647886
 ] 

Stacy Pedersen commented on SPARK-9477:
---

Fair enough, how about a link at the bottom of the supplemental Spark projects 
page for now? 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects ? 
I think this is a good start. Also we don't need code in Spark to support 
integration with Platform EGO for resource management.

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647910#comment-14647910
 ] 

Sean Owen commented on SPARK-9477:
--

Seems reasonable to me -- anybody else have an opinion? If not after a day or 
two I'll update the wiki.

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9225) LDASuite needs unit tests for empty documents

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9225.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7620
[https://github.com/apache/spark/pull/7620]

 LDASuite needs unit tests for empty documents
 -

 Key: SPARK-9225
 URL: https://issues.apache.org/jira/browse/SPARK-9225
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Feynman Liang
Assignee: Meihua Wu
Priority: Minor
  Labels: starter
 Fix For: 1.5.0


 We need to add a unit test to {{LDASuite}} which check that empty documents 
 are handled appropriately without crashing. This would require defining an 
 empty corpus within {{LDASuite}} and adding tests for the available LDA 
 optimizers (currently EM and Online). Note that only {{SparseVector}}s can be 
 empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData

2015-07-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9480:
--

 Summary: Create an map abstract class MapData and a default 
implementation backed by 2 ArrayData
 Key: SPARK-9480
 URL: https://issues.apache.org/jira/browse/SPARK-9480
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9480:
---

Assignee: (was: Apache Spark)

 Create an map abstract class MapData and a default implementation backed by 2 
 ArrayData
 ---

 Key: SPARK-9480
 URL: https://issues.apache.org/jira/browse/SPARK-9480
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647888#comment-14647888
 ] 

Apache Spark commented on SPARK-9480:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7799

 Create an map abstract class MapData and a default implementation backed by 2 
 ArrayData
 ---

 Key: SPARK-9480
 URL: https://issues.apache.org/jira/browse/SPARK-9480
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9480:
---

Assignee: Apache Spark

 Create an map abstract class MapData and a default implementation backed by 2 
 ArrayData
 ---

 Key: SPARK-9480
 URL: https://issues.apache.org/jira/browse/SPARK-9480
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647937#comment-14647937
 ] 

Apache Spark commented on SPARK-9077:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7800

 Improve error message for decision trees when numExamples  
 maxCategoriesPerFeature
 ---

 Key: SPARK-9077
 URL: https://issues.apache.org/jira/browse/SPARK-9077
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 See [SPARK-9075]'s discussion for details.  We should improve the current 
 error message to recommend that the user remove the high-arity categorical 
 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9077:
---

Assignee: (was: Apache Spark)

 Improve error message for decision trees when numExamples  
 maxCategoriesPerFeature
 ---

 Key: SPARK-9077
 URL: https://issues.apache.org/jira/browse/SPARK-9077
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 See [SPARK-9075]'s discussion for details.  We should improve the current 
 error message to recommend that the user remove the high-arity categorical 
 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9077:
---

Assignee: Apache Spark

 Improve error message for decision trees when numExamples  
 maxCategoriesPerFeature
 ---

 Key: SPARK-9077
 URL: https://issues.apache.org/jira/browse/SPARK-9077
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Trivial
  Labels: starter
   Original Estimate: 48h
  Remaining Estimate: 48h

 See [SPARK-9075]'s discussion for details.  We should improve the current 
 error message to recommend that the user remove the high-arity categorical 
 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9390) Create an array abstract class ArrayData and a default implementation backed by Array[Object]

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9390.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Create an array abstract class ArrayData and a default implementation backed 
 by Array[Object]
 -

 Key: SPARK-9390
 URL: https://issues.apache.org/jira/browse/SPARK-9390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
 Fix For: 1.5.0


 {code}
 interface ArrayData implements SpecializedGetters {
   int numElements();
   int sizeInBytes();
 }
 {code}
 We should also add to SpecializedGetters a method to get array, i.e.
 {code}
 interface SpecializedGetters {
   ...
   ArrayData getArray(int ordinal);
   ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9267) Remove highly unnecessary accumulators stringify methods

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9267.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7678
[https://github.com/apache/spark/pull/7678]

 Remove highly unnecessary accumulators stringify methods
 

 Key: SPARK-9267
 URL: https://issues.apache.org/jira/browse/SPARK-9267
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Priority: Trivial
 Fix For: 1.5.0


 {code}
 def stringifyPartialValue(partialValue: Any): String = 
 %s.format(partialValue)
 def stringifyValue(value: Any): String = %s.format(value)
 {code}
 These are only used in 1 place (DAGScheduler). The level of indirection 
 actually makes the code harder to read without an editor. We should just 
 inline them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9267) Remove highly unnecessary accumulators stringify methods

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9267:
-
Assignee: François Garillot

 Remove highly unnecessary accumulators stringify methods
 

 Key: SPARK-9267
 URL: https://issues.apache.org/jira/browse/SPARK-9267
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: François Garillot
Priority: Trivial
 Fix For: 1.5.0


 {code}
 def stringifyPartialValue(partialValue: Any): String = 
 %s.format(partialValue)
 def stringifyValue(value: Any): String = %s.format(value)
 {code}
 These are only used in 1 place (DAGScheduler). The level of indirection 
 actually makes the code harder to read without an editor. We should just 
 inline them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6486:
-
Assignee: Mike Dusenberry

 Add BlockMatrix in PySpark
 --

 Key: SPARK-6486
 URL: https://issues.apache.org/jira/browse/SPARK-6486
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Mike Dusenberry

 We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
 MatrixUDT for serialization. This JIRA should contain conversions between 
 IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
 linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange

2015-07-30 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9489:
-

 Summary: Remove compatibleWith, meetsRequirements, and 
needsAnySort checks from Exchange
 Key: SPARK-9489
 URL: https://issues.apache.org/jira/browse/SPARK-9489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
{{compatible}} check may be incorrectly returning {{false}} in many cases.  As 
far as I know, this is not actually a problem because the {{compatible}}, 
{{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
short-circuit performance optimizations that are not necessary for correctness.

In order to reduce code complexity, I think that we should remove these checks 
and unconditionally rewrite the operator's children.  This should be safe 
because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9493) Chain logistic regression with isotonic regression under the pipeline API

2015-07-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9493:


 Summary: Chain logistic regression with isotonic regression under 
the pipeline API
 Key: SPARK-9493
 URL: https://issues.apache.org/jira/browse/SPARK-9493
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


One use case of isotonic regression is to calibrate the probabilities output by 
logistic regression. We should make this easier in the pipeline API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9408.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7746
[https://github.com/apache/spark/pull/7746]

 Refactor mllib/linalg.py to mllib/linalg
 

 Key: SPARK-9408
 URL: https://issues.apache.org/jira/browse/SPARK-9408
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
 Fix For: 1.5.0


 We need to refactor mllib/linalg.py to mllib/linalg so that the project 
 structure is similar to that of Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9361:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

 Refactor new aggregation code to reduce the times of checking compatibility
 ---

 Key: SPARK-9361
 URL: https://issues.apache.org/jira/browse/SPARK-9361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently, we call aggregate.Utils.tryConvert in many places to check it the 
 logical.aggregate can be run with new aggregation. But looks like 
 aggregate.Utils.tryConvert costs much time to run. We should only call 
 tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9361.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7677
[https://github.com/apache/spark/pull/7677]

 Refactor new aggregation code to reduce the times of checking compatibility
 ---

 Key: SPARK-9361
 URL: https://issues.apache.org/jira/browse/SPARK-9361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Liang-Chi Hsieh
 Fix For: 1.5.0


 Currently, we call aggregate.Utils.tryConvert in many places to check it the 
 logical.aggregate can be run with new aggregation. But looks like 
 aggregate.Utils.tryConvert costs much time to run. We should only call 
 tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9361:

Assignee: Liang-Chi Hsieh

 Refactor new aggregation code to reduce the times of checking compatibility
 ---

 Key: SPARK-9361
 URL: https://issues.apache.org/jira/browse/SPARK-9361
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Currently, we call aggregate.Utils.tryConvert in many places to check it the 
 logical.aggregate can be run with new aggregation. But looks like 
 aggregate.Utils.tryConvert costs much time to run. We should only call 
 tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN

2015-07-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-8297.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Scheduler backend is not notified in case node fails in YARN
 

 Key: SPARK-8297
 URL: https://issues.apache.org/jira/browse/SPARK-8297
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
 Environment: Spark on yarn - both client and cluster mode.
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical
 Fix For: 1.5.0


 When a node crashes, yarn detects the failure and notifies spark - but this 
 information is not propagated to scheduler backend (unlike in mesos mode, for 
 example).
 It results in repeated re-execution of stages (due to FetchFailedException on 
 shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9388) Make log messages in ExecutorRunnable more readable

2015-07-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9388.
---
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.5.0

 Make log messages in ExecutorRunnable more readable
 ---

 Key: SPARK-9388
 URL: https://issues.apache.org/jira/browse/SPARK-9388
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Trivial
 Fix For: 1.5.0


 There's a couple of debug messages printed in ExecutorRunnable containing 
 information about the container being started. They're printed all in one 
 line, which makes them - especially the one containing the process's 
 environment - hard to read.
 We should make them nicer (like the similar one printed by Client.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648025#comment-14648025
 ] 

Shivaram Venkataraman commented on SPARK-9437:
--

Resolved by https://github.com/apache/spark/pull/7750

 SizeEstimator overflows for primitive arrays
 

 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor
 Fix For: 1.5.0


 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
 you have an {{Array[Double]}} of size 1  28.  This means that when you try 
 to broadcast a large primitive array, you get:
 {noformat}
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -2147483608
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9437.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 SizeEstimator overflows for primitive arrays
 

 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor
 Fix For: 1.5.0


 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
 you have an {{Array[Double]}} of size 1  28.  This means that when you try 
 to broadcast a large primitive array, you get:
 {noformat}
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -2147483608
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9481:
---

Assignee: (was: Apache Spark)

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9481:
-
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-5572)

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647991#comment-14647991
 ] 

Feynman Liang commented on SPARK-9481:
--

Working on this

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9481:


 Summary: LocalLDAModel logLikelihood
 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial


We already have a variational {{bound}} method so we should provide a public 
{{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8850) Turn unsafe mode on by default

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8850.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Turn unsafe mode on by default
 --

 Key: SPARK-8850
 URL: https://issues.apache.org/jira/browse/SPARK-8850
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.5.0


 Let's turn unsafe on and see what bugs we find in preparation for 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648039#comment-14648039
 ] 

Apache Spark commented on SPARK-9481:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7801

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9481:
---

Assignee: Apache Spark

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Apache Spark
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9282) Filter on Spark DataFrame with multiple columns

2015-07-30 Thread Sandeep Pal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reopened SPARK-9282:


On using '' instead of 'and', the following error occurs:

Py4JError Traceback (most recent call last)
ipython-input-8-b3101afeeb7a in module()
 1 df1.filter(df1.age  21  df1.age  45).show(10)

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py in 
_(self, other)
999 def _(self, other):
   1000 jc = other._jc if isinstance(other, Column) else other
- 1001 njc = getattr(self._jc, name)(jc)
   1002 return Column(njc)
   1003 _.__doc__ = doc

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}. 
Trace:\n{3}\n'.
-- 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(

Py4JError: An error occurred while calling o83.and. Trace:
py4j.Py4JException: Method and([class java.lang.Integer]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

 Filter on Spark DataFrame with multiple columns
 ---

 Key: SPARK-9282
 URL: https://issues.apache.org/jira/browse/SPARK-9282
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell, SQL
Affects Versions: 1.3.0
 Environment: CDH 5.0 on CentOS6
Reporter: Sandeep Pal

 Filter on dataframe does not work if we have more than one column inside the 
 filter. Nonetheless, it works on an RDD.
 Following is the example:
 df1.show()
 age coolid depid empname
 23  7  1 sandeep
 21  8  2 john   
 24  9  1 cena   
 45  12 3 bob
 20  7  4 tanay  
 12  8  5 gaurav 
 df1.filter(df1.age  21 and df1.age  45).show(10)
 23  7  1 sandeep
 21  8  2 john   -
 24  9  1 cena   
 20  7  4 tanay -
 12  8  5 gaurav   --



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm

2015-07-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648126#comment-14648126
 ] 

Xiangrui Meng commented on SPARK-8497:
--

Please provide the algorithm you want to implement, which should be based on 
some published work for correctness. I don't know how to handle the exponential 
growth of number of cliques. For example, if we have a clique of size 40, there 
will be (40 choose 20) cliques of size 20, which is more than 100 billion.

 Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
 --

 Key: SPARK-8497
 URL: https://issues.apache.org/jira/browse/SPARK-8497
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, ML, MLlib, Spark Core
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 72h
  Remaining Estimate: 72h

 In recent years, social network industry has high demand on Complete 
 Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph 
 connection from Twitter, the calls and other activities from telecoms world 
 form a huge social graph, and due to the nature of communication method, it 
 shows the strongest inter-person relationship, the graph based analysis will 
 reveal tremendous value from telecoms connections. 
 We need an algorithm in Spark to figure out ALL the strongest completely 
 connected sub-graph (so called Clique here) for EVERY person in the network 
 which will be one of the start point for understanding user's social 
 behaviour. 
 In Huawei, we have many real-world use cases that invovle telecom social 
 graph of tens billion edges and hundreds million vertices, and the cliques 
 will be also in tens million level. The graph will be a fast changing one 
 which means we need to analyse the graph pattern very often (one result per 
 day/week for moving time window which spans multiple months). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9463:
-
Assignee: Eric Liang

 Expose model coefficients with names in SparkR RFormula
 ---

 Key: SPARK-9463
 URL: https://issues.apache.org/jira/browse/SPARK-9463
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Reporter: Eric Liang
Assignee: Eric Liang

 Currently you cannot retrieve model statistics from the R side, we should at 
 least allow showing the coefficients for 1.5
 Design doc from umbrella task: 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9483) UTF8String.getPrefix only works in little-endian order

2015-07-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9483:
--

 Summary: UTF8String.getPrefix only works in little-endian order
 Key: SPARK-9483
 URL: https://issues.apache.org/jira/browse/SPARK-9483
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Summary: MLlib + SparkR integration for 1.5  (was: ML Pipeline API in 
SparkR)

 MLlib + SparkR integration for 1.5
 --

 Key: SPARK-6805
 URL: https://issues.apache.org/jira/browse/SPARK-6805
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng
Priority: Critical

 SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
 in SparkR. The implementation should be similar to the pipeline API 
 implementation in Python.
 For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
 basic support for R formula and elastic-net regularization. The design doc 
 can be viewed at 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Assignee: Eric Liang

 MLlib + SparkR integration for 1.5
 --

 Key: SPARK-6805
 URL: https://issues.apache.org/jira/browse/SPARK-6805
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng
Assignee: Eric Liang
Priority: Critical

 SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
 in SparkR. The implementation should be similar to the pipeline API 
 implementation in Python.
 For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
 basic support for R formula and elastic-net regularization. The design doc 
 can be viewed at 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9463:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6805

 Expose model coefficients with names in SparkR RFormula
 ---

 Key: SPARK-9463
 URL: https://issues.apache.org/jira/browse/SPARK-9463
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Reporter: Eric Liang
Assignee: Eric Liang

 Currently you cannot retrieve model statistics from the R side, we should at 
 least allow showing the coefficients for 1.5
 Design doc from umbrella task: 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Description: 
--SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
in SparkR. The implementation should be similar to the pipeline API 
implementation in Python.--

We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing

  was:
~~SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
in SparkR. The implementation should be similar to the pipeline API 
implementation in Python.~~

We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing


 MLlib + SparkR integration for 1.5
 --

 Key: SPARK-6805
 URL: https://issues.apache.org/jira/browse/SPARK-6805
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng
Assignee: Eric Liang
Priority: Critical

 --SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
 in SparkR. The implementation should be similar to the pipeline API 
 implementation in Python.--
 We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.
 For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
 basic support for R formula and elastic-net regularization. The design doc 
 can be viewed at 
 https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-9471:
---

 Summary: Multilayer perceptron 
 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


Implement Multilayer Perceptron for Spark ML. Requirements:
1) ML pipelines interface
2) Extensible internal interface for further development of artificial neural 
networks for ML
3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9471:
---

Assignee: Apache Spark

 Multilayer perceptron 
 --

 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Assignee: Apache Spark
 Fix For: 1.4.0

   Original Estimate: 8,736h
  Remaining Estimate: 8,736h

 Implement Multilayer Perceptron for Spark ML. Requirements:
 1) ML pipelines interface
 2) Extensible internal interface for further development of artificial neural 
 networks for ML
 3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647438#comment-14647438
 ] 

Apache Spark commented on SPARK-9471:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/7621

 Multilayer perceptron 
 --

 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 8,736h
  Remaining Estimate: 8,736h

 Implement Multilayer Perceptron for Spark ML. Requirements:
 1) ML pipelines interface
 2) Extensible internal interface for further development of artificial neural 
 networks for ML
 3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9471) Multilayer perceptron

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9471:
---

Assignee: (was: Apache Spark)

 Multilayer perceptron 
 --

 Key: SPARK-9471
 URL: https://issues.apache.org/jira/browse/SPARK-9471
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0

   Original Estimate: 8,736h
  Remaining Estimate: 8,736h

 Implement Multilayer Perceptron for Spark ML. Requirements:
 1) ML pipelines interface
 2) Extensible internal interface for further development of artificial neural 
 networks for ML
 3) Efficient and scalable: use vectors and BLAS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath

2015-07-30 Thread Baswaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647440#comment-14647440
 ] 

Baswaraj commented on SPARK-8622:
-

Any update on this ?

 Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor 
 classpath
 --

 Key: SPARK-8622
 URL: https://issues.apache.org/jira/browse/SPARK-8622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1, 1.4.0
Reporter: Baswaraj

 I ran into an issue that executor not able to pickup my configs/ function 
 from my custom jar in standalone (client/cluster) deploy mode. I have used 
 spark-submit --Jar option to specify all my jars and configs to be used by 
 executors.
 all these files are placed in working directory of executor, but not in 
 executor classpath.  Also, executor working directory is not in executor 
 classpath.
 I am expecting executor to find all files specified in spark-submit --jar 
 options .
 In spark 1.3.0 executor working directory is in executor classpath, so app 
 runs successfully.
 To successfully run my application with spark 1.3.1 +, i have to use  
 following option  (conf/spark-defaults.conf)
 spark.executor.extraClassPath   .
 Please advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-30 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647441#comment-14647441
 ] 

Imran Rashid commented on SPARK-3644:
-

[~zxzxy1988]  The test is here 
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala

It references test files here: 
https://github.com/apache/spark/tree/master/core/src/test/resources/HistoryServerExpectations

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen
Assignee: Imran Rashid
 Fix For: 1.4.0


 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647443#comment-14647443
 ] 

Sean Owen commented on SPARK-9470:
--

I feel pretty strongly this is not worth it. I'd like to close shortly unless 
there is a strenuous objection.

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4492) Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil

2015-07-30 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647446#comment-14647446
 ] 

sam commented on SPARK-4492:


I imagine building a fat jar for running with `java -cp` is possible, but I 
have never managed to get it to work. It would be great if upon each release of 
Spark, an example build file could be provided.

 Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
 --

 Key: SPARK-4492
 URL: https://issues.apache.org/jira/browse/SPARK-4492
 Project: Spark
  Issue Type: Bug
Reporter: sam

 When I follow the example here 
 https://spark.apache.org/docs/1.0.2/quick-start.html and run with java -cp 
 my.jar my.main.Class with master set to yarn-client I get the below 
 exception.
 Exception in thread main java.lang.ExceptionInInitializerError
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at com.barclays.SimpleApp$.main(SimpleApp.scala:11)
   at com.barclays.SimpleApp.main(SimpleApp.scala)
 Caused by: org.apache.spark.SparkException: Unable to load YARN support
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:106)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:101)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   ... 3 more
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:169)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:102)
   ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Rahul Kavale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647457#comment-14647457
 ] 

Rahul Kavale edited comment on SPARK-9470 at 7/30/15 10:49 AM:
---

Hi Sean, I just wanted to do this small cleanup which felt like obviously 
redundant code to me.


was (Author: rahulkavale):
Hi Sean, I just wanted to do this small cleanup which felt like obviously 
redundant code for me.

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Rahul Kavale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647457#comment-14647457
 ] 

Rahul Kavale commented on SPARK-9470:
-

Hi Sean, I just wanted to do this small cleanup which felt like obviously 
redundant code for me.

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647473#comment-14647473
 ] 

Sean Owen commented on SPARK-9470:
--

[~rahulkavale] see my comment on the PR. I also follow this convention, but, 
many people don't because it's not obviously redundant -- some would argue that 
interface methods should _always_ be marked {{public}} because they are always 
implicitly {{public}} and removing the access modifier makes it look to those 
who don't know the difference in behavior that these are package-private.

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9359) Support IntervalType for Parquet

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9359:
---

Assignee: Apache Spark

 Support IntervalType for Parquet
 

 Key: SPARK-9359
 URL: https://issues.apache.org/jira/browse/SPARK-9359
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark

 SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
 {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9359) Support IntervalType for Parquet

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647479#comment-14647479
 ] 

Apache Spark commented on SPARK-9359:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7793

 Support IntervalType for Parquet
 

 Key: SPARK-9359
 URL: https://issues.apache.org/jira/browse/SPARK-9359
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian

 SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
 {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9359) Support IntervalType for Parquet

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9359:
---

Assignee: (was: Apache Spark)

 Support IntervalType for Parquet
 

 Key: SPARK-9359
 URL: https://issues.apache.org/jira/browse/SPARK-9359
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian

 SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet 
 {{INTERVAL}} logical type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2015-07-30 Thread Thomas Demoor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647406#comment-14647406
 ] 

Thomas Demoor commented on SPARK-7481:
--

Pulled the aws-upgrade out of HADOOP-11684 to a separate issue HADOOP-12269. 
Only uses aws-sdk-s3-1.10.6 instead of the entire sdk.

 Add Hadoop 2.6+ profile to pull in object store FS accessors
 

 Key: SPARK-7481
 URL: https://issues.apache.org/jira/browse/SPARK-7481
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Steve Loughran

 To keep the s3n classpath right, to add s3a, swift  azure, the dependencies 
 of spark in a 2.6+ profile need to add the relevant object store packages 
 (hadoop-aws, hadoop-openstack, hadoop-azure)
 this adds more stuff to the client bundle, but will mean a single spark 
 package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8838) Add config to enable/disable merging part-files when merging parquet schema

2015-07-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8838.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7238
[https://github.com/apache/spark/pull/7238]

 Add config to enable/disable merging part-files when merging parquet schema
 ---

 Key: SPARK-8838
 URL: https://issues.apache.org/jira/browse/SPARK-8838
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Currently all part-files are merged when merging parquet schema. However, in 
 case there are many part-files and we can make sure that all the part-files 
 have the same schema as their summary file. If so, we provide a configuration 
 to disable merging part-files when merging parquet schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-07-30 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647432#comment-14647432
 ] 

Robin East commented on SPARK-5692:
---

Hi the description includes the sentence 'We may want to discuss whether we 
want to be compatible with the original Word2Vec model storage format.'. Was 
this ever discussed - I can't see anything in comment stream for this JIRA. Is 
there any interest in adding functionality to import Word2Vec models from the 
original binary format (e.g. the 300 million word Google News model).

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.4.0


 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647380#comment-14647380
 ] 

Manoj Kumar commented on SPARK-9277:


I will not have access to a development environment till Saturday. Feel free to 
fix it. Thanks.

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Rahul Kavale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647487#comment-14647487
 ] 

Rahul Kavale commented on SPARK-9470:
-

[~srowen] I see your point, just felt if we all know methods in an interface 
are implicitly public, then what value does it add making them 'public' 
explicitly. Anyways, thanks for your comment. Closing the issue.

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9470) Java API function interface cleanup

2015-07-30 Thread Rahul Kavale (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Kavale closed SPARK-9470.
---
Resolution: Won't Fix

 Java API function interface cleanup
 ---

 Key: SPARK-9470
 URL: https://issues.apache.org/jira/browse/SPARK-9470
 Project: Spark
  Issue Type: Improvement
Reporter: Rahul Kavale
Priority: Trivial

 Hi guys,
 I was exploring Spark codebase, and came across the Java API function 
 interfaces. The interfaces have the 'call' method as 'public' which is 
 redundant.
 https://github.com/apache/spark/pull/7790



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9277:
---

Assignee: Apache Spark

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Assignee: Apache Spark
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647490#comment-14647490
 ] 

Apache Spark commented on SPARK-9277:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7794

 SparseVector constructor must throw an error when declared number of elements 
 less than array length
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
  Labels: starter
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   >