from:"Patrick Wendell \(JIRA\)"

[jira] [Commented] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391168#comment-15391168
 ] 

Patrick Wendell commented on SPARK-16685:
-

These scripts are pretty old and I'm not sure if anyone still uses them. I had 
written them a while back as sanity tests for some release builds. Today, those 
things are tested broadly by the community so I think this has become 
redundant. [~rxin] are these still used? If not, it might be good to remove 
them from the source repo.

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-13855.
-
   Resolution: Fixed
Fix Version/s: 1.6.1

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196901#comment-15196901
 ] 

Patrick Wendell commented on SPARK-13855:
-

I've uploaded the artifacts, thanks.

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-13855:
---

Assignee: Patrick Wendell  (was: Michael Armbrust)

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Priority: Major  (was: Critical)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Priority: Critical  (was: Minor)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Wish
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Critical
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Issue Type: Improvement  (was: Wish)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Critical
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12110:

Description: 
I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I can not run the tokenizer sample code. Is there a 
work around?

Kind regards

Andy

{code}
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
658 raise Exception("You must build Spark with Hive. "
659 "Export 'SPARK_HIVE=true' and run "
--> 660 "build/sbt assembly", e)
661 
662 def _get_hive_ctx(self):

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError('An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))




http://spark.apache.org/docs/latest/ml-features.html#tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer

sentenceDataFrame = sqlContext.createDataFrame([
  (0, "Hi I heard about Spark"),
  (1, "I wish Java could use case classes"),
  (2, "Logistic,regression,models,are,neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
for words_label in wordsDataFrame.select("words", "label").take(3):
  print(words_label)

---
Py4JJavaError Traceback (most recent call last)
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
654 if not hasattr(self, '_scala_HiveContext'):
--> 655 self._scala_HiveContext = self._get_hive_ctx()
656 return self._scala_HiveContext

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
662 def _get_hive_ctx(self):
--> 663 return self._jvm.HiveContext(self._jsc.sc())
664 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
700 return_value = get_return_value(answer, self._gateway_client, 
None,
--> 701 self._fqn)
702 

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.io.IOException: Filesystem closed
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at 
org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 15 more


During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
 in

[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12110:

Component/s: (was: ML)
 (was: SQL)
 (was: PySpark)
 EC2

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:214)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by:

[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036960#comment-15036960
 ] 

Patrick Wendell commented on SPARK-12110:
-

Hey Andrew, could you show exactly the command you are running to run this 
example? Also, if you simply download Spark 1.5.1 and run the same command 
locally rather than in your modified EC2 cluster, does it work?

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:214)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
>   at

[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell commented on SPARK-11903:
-

I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell edited comment on SPARK-11903 at 11/23/15 4:29 AM:
---

I think it's simply dead code that should be deleted. SKIP_JAVA_TEST related to 
a check we did regarding whether Java 6 was being used instead of Java 7. It 
doesn't have anything to do with unit tests. Spark now requires Java 7, so the 
test has been removed, but the parser still handles that variable. It was just 
an omission not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].


was (Author: pwendell):
I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode

2015-11-09 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997448#comment-14997448
 ] 

Patrick Wendell commented on SPARK-11326:
-

There are a few related conversations here:

1. The feature set of standalone scheduler and goals. The main goal of that 
scheduler is to make it easy for people to download and run Spark with minimal 
extra dependencies. The main difference between the standalone mode and other 
schedulers is that we aren't providing support for scheduling other frameworks 
than Spark (and likely never will). Other than that, features are added on a 
case-by-case basis depending on whether there is sufficient commitment from the 
maintainers to support the feature long term.

2. Security in non-YARN modes. I would actually like to see better support for 
security in other modes of Spark, the main reason being supporting the large 
number of users not inside of Hadoop deployments. BTW, I think the existing 
security architecture of Spark makes this possible, because the concern of 
distributing a shared secret is largely decoupled from the specific security 
mechanism. But we haven't really exposed public hooks for injecting secrets. 
There is also the question of secure job submission which is addressed in this 
JIRA. This needs some thought and probably makes sense to discuss on the Spark 
1.7 timeframe.

Overall I think some broader questions need to be answered, and it's something 
perhaps we can discuss once 1.6 is out the door as we think about 1.7.

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
>

[jira] [Resolved] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-11-02 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-11236.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
> Fix For: 1.6.0
>
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-11-02 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11236:

Assignee: Calvin Jia

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
>Assignee: Calvin Jia
> Fix For: 1.6.0
>
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11446:

Target Version/s: 1.6.0

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-11446:
---

 Summary: Spark 1.6 release notes
 Key: SPARK-11446
 URL: https://issues.apache.org/jira/browse/SPARK-11446
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Critical


This is a staging location where we can keep track of changes that need to be 
documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function

2015-11-01 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984646#comment-14984646
 ] 

Patrick Wendell commented on SPARK-11238:
-

I created SPARK-11446 and linked it here.

> SparkR: Documentation change for merge function
> ---
>
> Key: SPARK-11238
> URL: https://issues.apache.org/jira/browse/SPARK-11238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/9012, the 
> signature of the merge function will be changed, therefore documentation 
> change is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984776#comment-14984776
 ] 

Patrick Wendell commented on SPARK-11446:
-

I think this is redundant with the "releasenotes" tag so I am closing it.

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-11446.
---
Resolution: Invalid

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973493#comment-14973493
 ] 

Patrick Wendell commented on SPARK-11305:
-

/cc [~srowen] for his thoughts.

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-11305:
---

 Summary: Remove Third-Party Hadoop Distributions Doc Page
 Key: SPARK-11305
 URL: https://issues.apache.org/jira/browse/SPARK-11305
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Priority: Critical


There is a fairly old page in our docs that contains a bunch of assorted 
information regarding running Spark on Hadoop clusters. I think this page 
should be removed and merged into other parts of the docs because the 
information is largely redundant and somewhat outdated.

http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html

There are three sections:

1. Compile time Hadoop version - this information I think can be removed in 
favor of that on the "building spark" page. These days most "advanced users" 
are building without bundling Hadoop, so I'm not sure giving them a bunch of 
different Hadoop versions sends the right message.

2. Linking against Hadoop - this doesn't seem to add much beyond what is in the 
programming guide.

3. Where to run Spark - redundant with the hardware provisioning guide.

4. Inheriting cluster configurations - I think this would be better as a 
section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell commented on SPARK-10971:
-

Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell edited comment on SPARK-10971 at 10/26/15 12:02 AM:


Reynold has sent out the vote email based on the tagged commit. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.


was (Author: pwendell):
Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10971:

Fix Version/s: (was: 1.5.2)
   1.5.3

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-11070:
---

Assignee: Patrick Wendell

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961515#comment-14961515
 ] 

Patrick Wendell commented on SPARK-11070:
-

I removed them - I did leave 1.5.0 for now, but we can remove it in a bit - 
just because 1.5.1 is so new.

{code}
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.1.1 -m "Remving 
Spark 1.1.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.1 -m "Remving 
Spark 1.2.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.2 -m "Remving 
Spark 1.2.2 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.3.0 -m "Remving 
Spark 1.3.0 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.4.0 -m "Remving 
Spark 1.4.0 release"
{code}

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-11070.
-
Resolution: Fixed

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10877:

Assignee: Davies Liu

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>Assignee: Davies Liu
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Assignee: Jakob Odersky

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Priority: Critical  (was: Major)

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-0:
---

 Summary: Scala 2.11 build fails due to compiler errors
 Key: SPARK-0
 URL: https://issues.apache.org/jira/browse/SPARK-0
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell


Right now the 2.11 build is failing due to compiler errors in SBT (though not 
in Maven). I have updated our 2.11 compile test harness to catch this.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull

{code}
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
 no valid targets for annotation on value conf - it is discarded unused. You 
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error] 
{code}

This is one error, but there may be others past this point (the compile fails 
fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell edited comment on SPARK-5 at 10/15/15 12:38 AM:


The title of this says "Regression" - did it regress from a previous version? I 
am going to update the title, let me know if there is any issue.


was (Author: pwendell):
The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11081) Shade Jersey dependency to work around the compatibility issue with Jersey2

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11081:

Component/s: Build

> Shade Jersey dependency to work around the compatibility issue with Jersey2
> ---
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11092:

Assignee: Jakob Odersky

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11115) Host verification is not correct for IPv6

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5:

Summary: Host verification is not correct for IPv6  (was: IPv6 regression)

> Host verification is not correct for IPv6
> -
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell commented on SPARK-5:
-

The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11006:

Component/s: SQL

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1:

Component/s: SQL

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11056) Improve documentation on how to build Spark efficiently

2015-10-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11056:

Component/s: Documentation

> Improve documentation on how to build Spark efficiently
> ---
>
> Key: SPARK-11056
> URL: https://issues.apache.org/jira/browse/SPARK-11056
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> Slow build times are a common pain point for new Spark developers.  We should 
> improve the main documentation on building Spark to describe how to make 
> building Spark less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6230) Provide authentication and encryption for Spark's RPC

2015-10-13 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954523#comment-14954523
 ] 

Patrick Wendell commented on SPARK-6230:


Should we update Spark's documentation to explain this? I think at present it 
only discusses encrypted RPC via akka. But this will be the new recommended way 
to encrypt RPC.

> Provide authentication and encryption for Spark's RPC
> -
>
> Key: SPARK-6230
> URL: https://issues.apache.org/jira/browse/SPARK-6230
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>
> Make sure the RPC layer used by Spark supports the auth and encryption 
> features of the network/common module.
> This kinda ignores akka; adding support for SASL to akka, while possible, 
> seems to be at odds with the direction being taken in Spark, so let's 
> restrict this to the new RPC layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)

Patrick Wendell created FLINK-2699:
--

 Summary: Flink is filling Spark JIRA with incorrect PR links
 Key: FLINK-2699
 URL: https://issues.apache.org/jira/browse/FLINK-2699
 Project: Flink
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker


I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)

Patrick Wendell created FLINK-2699:
--

 Summary: Flink is filling Spark JIRA with incorrect PR links
 Key: FLINK-2699
 URL: https://issues.apache.org/jira/browse/FLINK-2699
 Project: Flink
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker


I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated FLINK-2699:
---
Description: 
I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago. There are around 23 links that were 
created - if you could go ahead and remove them that would be useful. Thanks!

  was:
I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.


> Flink is filling Spark JIRA with incorrect PR links
> ---
>
> Key: FLINK-2699
> URL: https://issues.apache.org/jira/browse/FLINK-2699
> Project: Flink
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Priority: Blocker
>
> I think you guys are using our script for synchronizing JIRA. However, you 
> didn't adjust the target JIRA identifier so it is still posting to Spark. In 
> the past few hours we've seen a lot of random Flink pull requests being 
> linked on the Spark JIRA. This is obviously not desirable for us since they 
> are different projects.
> The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm
> I saw these as recently as 5 hours ago. There are around 23 links that were 
> created - if you could go ahead and remove them that would be useful. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804497#comment-14804497
 ] 

Patrick Wendell commented on FLINK-2699:


Great - thanks for cleaning this up. No worries.

> Flink is filling Spark JIRA with incorrect PR links
> ---
>
> Key: FLINK-2699
> URL: https://issues.apache.org/jira/browse/FLINK-2699
> Project: Flink
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Priority: Blocker
>
> I think you guys are using our script for synchronizing JIRA. However, you 
> didn't adjust the target JIRA identifier so it is still posting to Spark. In 
> the past few hours we've seen a lot of random Flink pull requests being 
> linked on the Spark JIRA. This is obviously not desirable for us since they 
> are different projects.
> The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm
> I saw these as recently as 5 hours ago. There are around 23 links that were 
> created - if you could go ahead and remove them that would be useful. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Description: 
In 1.5.0 there are some extra classes in the Spark docs - including a bunch of 
test classes. We need to figure out what commit introduced those and fix it. 
The obvious things like genJavadoc version have not changed.

http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ [before]
http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ [after]


> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Priority: Critical  (was: Major)

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Affects Version/s: 1.5.0

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-10650:
---

 Summary: Spark docs include test and other extra classes
 Key: SPARK-10650
 URL: https://issues.apache.org/jira/browse/SPARK-10650
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Target Version/s: 1.5.1

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes

2015-09-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6942:
---
Assignee: Andrew Or  (was: Patrick Wendell)

> Umbrella: UI Visualizations for Core and Dataframes 
> 
>
> Key: SPARK-6942
> URL: https://issues.apache.org/jira/browse/SPARK-6942
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, Web UI
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.5.0
>
>
> This is an umbrella issue for the assorted visualization proposals for 
> Spark's UI. The scope will likely cover Spark 1.4 and 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-10620:
---

 Summary: Look into whether accumulator mechanism can replace 
TaskMetrics
 Key: SPARK-10620
 URL: https://issues.apache.org/jira/browse/SPARK-10620
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or


This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think there are a few things to look into:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Is it worth it to do this, or is the consolidation too complicated to justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10620:

Description: 
This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think a full exploration would answer the following questions:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Are there any other considerations?
- Is it worth it to do this, or is the consolidation too complicated to justify?

  was:
This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think there are a few things to look into:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Is it worth it to do this, or is the consolidation too complicated to justify?


> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745690#comment-14745690
 ] 

Patrick Wendell commented on SPARK-10620:
-

/cc [~imranr] and [~srowen] for any comments. In my mind the goal here is just 
to produce some design thoughts and not to actually do it (at this point).

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10511) Source releases should not include maven jars

2015-09-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10511:

Assignee: Luciano Resende

> Source releases should not include maven jars
> -
>
> Key: SPARK-10511
> URL: https://issues.apache.org/jira/browse/SPARK-10511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Luciano Resende
>Priority: Blocker
>
> I noticed our source jars seemed really big for 1.5.0. At least one 
> contributing factor is that, likely due to some change in the release script, 
> the maven jars are being bundled in with the source code in our build 
> directory. This runs afoul of the ASF policy on binaries in source releases - 
> we should fix it in 1.5.1.
> The issue (I think) is that we might invoke maven to compute the version 
> between when we checkout Spark from github and when we package the source 
> file. I think it could be fixed by simply clearing out the build/ directory 
> after that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty

2015-09-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10623:

Component/s: SQL

> turning on predicate pushdown throws nonsuch element exception when RDD is 
> empty 
> -
>
> Key: SPARK-10623
> URL: https://issues.apache.org/jira/browse/SPARK-10623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ram Sriharsha
>Assignee: Zhan Zhang
>
> Turning on predicate pushdown for ORC datasources results in a 
> NoSuchElementException:
> scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")
> df: org.apache.spark.sql.DataFrame = [name: string]
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> scala> df.explain
> == Physical Plan ==
> java.util.NoSuchElementException
> Disabling the pushdown makes things work again:
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")
> scala> df.explain
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 < 15)
>   Scan 
> OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10601) Spark SQL - Support for MINUS

2015-09-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10601:

Component/s: SQL

> Spark SQL - Support for MINUS
> -
>
> Key: SPARK-10601
> URL: https://issues.apache.org/jira/browse/SPARK-10601
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Garris
>
> Spark SQL does not current supported SQL Minus



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10600) SparkSQL - Support for Not Exists in a Correlated Subquery

2015-09-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10600:

Component/s: SQL

> SparkSQL - Support for Not Exists in a Correlated Subquery
> --
>
> Key: SPARK-10600
> URL: https://issues.apache.org/jira/browse/SPARK-10600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Garris
>
> Spark SQL currently does not support NOT EXISTS clauses (e.g. 
> SELECT * FROM TABLE_A WHERE NOT EXISTS ( SELECT 1 FROM TABLE_B where 
> TABLE_B.id = TABLE_A.id)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742280#comment-14742280
 ] 

Patrick Wendell commented on SPARK-10576:
-

FWIW - seems to me like moving them into /java makes sense. If we are going to 
have src/main/scala and src/main/java, might as well use them correctly. What 
do you think [~rxin].

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10511) Source releases should not include maven jars

2015-09-09 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-10511:
---

 Summary: Source releases should not include maven jars
 Key: SPARK-10511
 URL: https://issues.apache.org/jira/browse/SPARK-10511
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Patrick Wendell
Priority: Blocker


I noticed our source jars seemed really big for 1.5.0. At least one 
contributing factor is that, likely due to some change in the release script, 
the maven jars are being bundled in with the source code in our build 
directory. This runs afoul of the ASF policy on binaries in source releases - 
we should fix it in 1.5.1.

The issue (I think) is that we might invoke maven to compute the version 
between when we checkout Spark from github and when we package the source file. 
I think it could be fixed by simply clearing out the build/ directory after 
that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723792#comment-14723792
 ] 

Patrick Wendell commented on SPARK-10374:
-

Hey Matt,

I think the only thing that could have influenced you is that we changed our 
default advertised akka dependency. We used to advertise an older version of 
akka that shaded protobuf. What happens if you manually coerce that version of 
akka in your application?

Spark itself doesn't directly use protobuf. But some of our dependencies do, 
including both akka and Hadoop. My guess is that you are now in a situation 
where you can't reconcile the akka and hadoop protobuf versions and make them 
both happy. This would be consistent with the changes we made in 1.5 in 
SPARK-7042.

The fix would be to exclude all com.typsafe.akka artifacts from Spark and 
manually add org.spark-project.akka to your build.

However, since you didn't post a full stack trace, I can't know for sure 
whether it is akka that complains when you try to fix the protobuf version at 
2.4.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \---

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2015-08-31 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723844#comment-14723844
 ] 

Patrick Wendell commented on SPARK-10359:
-

The approach in SPARK-4123 was a bit different, but there is some overlap. We 
ended up reverting that patch because it wasn't working consistently. I'll 
close that one as a dup of this one.

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests

2015-08-31 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4123.

Resolution: Duplicate

I've proposed a slightly different approach in SPARK-10359, so I'm closing this 
since there is high overlap.

> Show dependency changes in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Brennon York
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has [test-maven] in it

2015-08-30 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-9545:
---
Summary: Run Maven tests in pull request builder if title has 
[test-maven] in it  (was: Run Maven tests in pull request builder if title 
has [maven-test] in it)

 Run Maven tests in pull request builder if title has [test-maven] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9547) Allow testing pull requests with different Hadoop versions

2015-08-30 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9547.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Allow testing pull requests with different Hadoop versions
 --

 Key: SPARK-9547
 URL: https://issues.apache.org/jira/browse/SPARK-9547
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 Similar to SPARK-9545 we should allow testing different Hadoop profiles in 
 the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it

2015-08-30 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9545.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Run Maven tests in pull request builder if title has [maven-test] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2015-08-30 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-10359:
---

 Summary: Enumerate Spark's dependencies in a file and diff against 
it for new pull requests 
 Key: SPARK-10359
 URL: https://issues.apache.org/jira/browse/SPARK-10359
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell


Sometimes when we have dependency changes it can be pretty unclear what 
transitive set of things are changing. If we enumerate all of the dependencies 
and put them in a source file in the repo, we can make it so that it is very 
explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7726) Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher]

2015-08-10 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680885#comment-14680885
 ] 

Patrick Wendell commented on SPARK-7726:


[~srowen] [~dragos] This is cropping up again when trying to create a release 
candidate for Spark 1.5:

https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Release-All-Java7/26/console

 Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher]
 -

 Key: SPARK-7726
 URL: https://issues.apache.org/jira/browse/SPARK-7726
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Blocker
 Fix For: 1.4.0


 This one took a long time to track down. The Maven install phase is part of 
 our release process. It runs the scala:doc target to generate doc jars. 
 Between Scala 2.11.2 and Scala 2.11.3, the behavior of this plugin changed in 
 a way that breaks our build. In both cases, it returned an error (there has 
 been a long running error here that we've always ignored), however in 2.11.3 
 that error became fatal and failed the entire build process. The upgrade 
 occurred in SPARK-7092. Here is a simple reproduction:
 {code}
 ./dev/change-version-to-2.11.sh
 mvn clean install -pl network/common -pl network/shuffle -DskipTests 
 -Dscala-2.11
 {code} 
 This command exits success when Spark is at Scala 2.11.2 and fails with 
 2.11.3 or higher. In either case an error is printed:
 {code}
 [INFO] 
 [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ 
 spark-network-shuffle_2.11 ---
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56:
  error: not found: type Type
   protected Type type() { return Type.UPLOAD_BLOCK; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37:
  error: not found: type Type
   protected Type type() { return Type.STREAM_HANDLE; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
  error: not found: type Type
   protected Type type() { return Type.REGISTER_EXECUTOR; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
  error: not found: type Type
   protected Type type() { return Type.OPEN_BLOCKS; }
 ^
 model contains 22 documentable templates
 four errors found
 {code}
 Ideally we'd just dig in and fix this error. Unfortunately it's a very 
 confusing error and I have no idea why it is appearing. I'd propose reverting 
 SPARK-7092 in the mean time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-08-06 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660796#comment-14660796
 ] 

Patrick Wendell commented on SPARK-1517:


Hey Ryan,

IIRC - the Apache snapshot repository won't let us publish binaries that do not 
have SNAPSHOT in the version number. The reason is it expects to see 
timestamped snapshots so its garbage collection mechanism can work. We could 
look at adding sha1 hashes, before SNAPSHOT, but I think there is some chance 
this would break their cleanup.

In terms of posting more binaries - I can look at whether Databricks or 
Berkeley might be able to donate S3 resources for this, but it would have to be 
clearly maintained by those organizations and not branded as official Apache 
releases or anything like that.

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-08-06 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660420#comment-14660420
]

Patrick Wendell commented on SPARK-1517:

Hey Ryan,

For the maven snapshot releases - unfortunately we are constrained by maven's
own SNAPSHOT version format which doesn't allow encoding anything other than
the timestamp. It's just not supported in their SNAPSHOT mechanism. However,
one thing we could see is whether we can align the timestamp with the time of
the actual spark commit, rather than the time of publication of the SNAPSHOT
release. I'm not sure if maven lets you provide a custom timestamp when
publishing. If we had that feature users could look at the Spark commit log and
do some manual association.

For the binaries, the reason why the same commit appears multiple times is that
we do the build every four hours and always publish the latest one even if it's
a duplicate. However, this could be modified pretty easily to just avoid
double-publishing the same commit if there hasn't been any code change. Maybe
create a JIRA for this?

In terms of how many older versions are available, the scripts we use for this
have a tunable retention window. Right now I'm only keeping the last 4 builds,
we could probably extend it to something like 10 builds. However, at some point
I'm likely to blow out of space in my ASF user account. Since the binaries are
quite large, I don't think at least using ASF infrastructure it's feasible to
keep all past builds. We have 3000 commits in a typical Spark release, and it's
a few gigs for each binary build.

Publish nightly snapshots of documentation, maven artifacts, and binary builds
--

Key: SPARK-1517
URL: https://issues.apache.org/jira/browse/SPARK-1517
Project: Spark
Issue Type: Improvement
Components: Build, Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

Should be pretty easy to do with Jenkins. The only thing I can think of that
would be tricky is to set up credentials so that jenkins can publish this
stuff somewhere on apache infra.
Ideally we don't want to have to put a private key on every jenkins box
(since they are otherwise pretty stateless). One idea is to encrypt these
credentials with a passphrase and post them somewhere publicly visible. Then
the jenkins build can download the credentials provided we set a passphrase
in an environment variable in jenkins. There may be simpler solutions as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9547) Allow testing pull requests with different Hadoop versions

2015-08-02 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-9547:
--

 Summary: Allow testing pull requests with different Hadoop versions
 Key: SPARK-9547
 URL: https://issues.apache.org/jira/browse/SPARK-9547
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell


Similar to SPARK-9545 we should allow testing different Hadoop profiles in the 
PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it

2015-08-02 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-9545:
---
Issue Type: Improvement  (was: Bug)

 Run Maven tests in pull request builder if title has [maven-test] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it

2015-08-02 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-9545:
--

 Summary: Run Maven tests in pull request builder if title has 
[maven-test] in it
 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell


We have infrastructure now in the build tooling for running maven tests, but 
it's not actually used anywhere. With a very minor change we can support 
running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9423) Why do every other spark comiter keep suggesting to use spark-submit script

2015-07-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9423.

Resolution: Invalid

 Why do every other spark comiter keep suggesting to use spark-submit script
 ---

 Key: SPARK-9423
 URL: https://issues.apache.org/jira/browse/SPARK-9423
 Project: Spark
  Issue Type: Question
  Components: Deploy
Affects Versions: 1.3.1
Reporter: nirav patel

 I see that on spark forum and stackoverflow people keep suggesting to use 
 spark-submit.sh script as a way (only way) to launch spark jobs? Are we still 
 living in application server monolithic world where I need to run startup.sh 
 ? What if spark application is long running context that serves multiple 
 requests? What if user just don't want to use script? They want to embed 
 spark as a service in their application. 
 Please STOP suggesting user to use spark-submit script as an alternative. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9423) Why do every other spark comiter keep suggesting to use spark-submit script

2015-07-28 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645495#comment-14645495
 ] 

Patrick Wendell commented on SPARK-9423:


This is not a valid issue for JIRA (we use JIRA for project bugs and feature 
tracking). Please send an email to the spark-users list. Thanks.

 Why do every other spark comiter keep suggesting to use spark-submit script
 ---

 Key: SPARK-9423
 URL: https://issues.apache.org/jira/browse/SPARK-9423
 Project: Spark
  Issue Type: Question
  Components: Deploy
Affects Versions: 1.3.1
Reporter: nirav patel

 I see that on spark forum and stackoverflow people keep suggesting to use 
 spark-submit.sh script as a way (only way) to launch spark jobs? Are we still 
 living in application server monolithic world where I need to run startup.sh 
 ? What if spark application is long running context that serves multiple 
 requests? What if user just don't want to use script? They want to embed 
 spark as a service in their application. 
 Please STOP suggesting user to use spark-submit script as an alternative. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9304) Improve backwards compatibility of SPARK-8401

2015-07-24 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-9304:
--

 Summary: Improve backwards compatibility of SPARK-8401
 Key: SPARK-9304
 URL: https://issues.apache.org/jira/browse/SPARK-9304
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Michael Allman
Priority: Critical


In SPARK-8401 a backwards incompatible change was made to the scala 2.11 build 
process. It would be good to add scripts with the older names to avoid breaking 
compatibility for harnesses or other automated builds that build for Scala 
2.11. The can just be a one line shell script with a comment explaining it is 
for backwards compatibility purposes.

/cc [~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-07-24 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8703:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-8521

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: yuhao yang
Assignee: yuhao yang
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts. Similar to 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8564) Add the Python API for Kinesis

2015-07-23 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8564:
---
Target Version/s: 1.5.0

 Add the Python API for Kinesis
 --

 Key: SPARK-8564
 URL: https://issues.apache.org/jira/browse/SPARK-8564
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7920) Make MLlib ChiSqSelector Serializable ( Fix Related Documentation Example).

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7920:
---
Labels:   (was: spark.tc)

 Make MLlib ChiSqSelector Serializable ( Fix Related Documentation Example).
 

 Key: SPARK-7920
 URL: https://issues.apache.org/jira/browse/SPARK-7920
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.0


 The MLlib ChiSqSelector class is not serializable, and so the example in the 
 ChiSqSelector documentation fails.  Also, that example is missing the import 
 of ChiSqSelector.  ChiSqSelector should just extend Serializable.
 Steps:
 1. Locate the MLlib ChiSqSelector documentation example.
 2. Fix the example by adding an import statement for ChiSqSelector.
 3. Attempt to run - notice that it will fail due to ChiSqSelector not being 
 serializable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8927) Doc format wrong for some config descriptions

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8927:
---
Labels:   (was: spark.tc)

 Doc format wrong for some config descriptions
 -

 Key: SPARK-8927
 URL: https://issues.apache.org/jira/browse/SPARK-8927
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Jon Alter
Assignee: Jon Alter
Priority: Trivial
 Fix For: 1.4.2, 1.5.0


 In the docs, a couple descriptions of configuration (under Network) are not 
 inside td/td and are being displayed immediately under the section title 
 instead of in their row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7985) Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples.

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7985:
---
Labels:   (was: spark.tc)

 Remove fittingParamMap references. Update ML Doc Estimator, Transformer, 
 and Param examples.
 

 Key: SPARK-7985
 URL: https://issues.apache.org/jira/browse/SPARK-7985
 Project: Spark
  Issue Type: Bug
  Components: Documentation, ML
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.0


 Update ML Doc's Estimator, Transformer, and Param Scala  Java examples to 
 use model.extractParamMap instead of model.fittingParamMap, which no longer 
 exists.  Remove all other references to fittingParamMap throughout Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7969) Drop method on Dataframes should handle Column

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7969:
---
Labels:   (was: spark.tc)

 Drop method on Dataframes should handle Column
 --

 Key: SPARK-7969
 URL: https://issues.apache.org/jira/browse/SPARK-7969
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Olivier Girardot
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.1, 1.5.0


 For now the drop method available on Dataframe since Spark 1.4.0 only accepts 
 a column name (as a string), it should also accept a Column as input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7830) ML doc cleanup: logreg, classification link

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7830:
---
Labels:   (was: spark.tc)

 ML doc cleanup: logreg, classification link
 ---

 Key: SPARK-7830
 URL: https://issues.apache.org/jira/browse/SPARK-7830
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Trivial
 Fix For: 1.4.0


 Add logistic regression to the list of Multiclass Classification Supported 
 Methods in the MLlib Classification and Regression documentation, and fix 
 related broken link.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8343) Improve the Spark Streaming Guides

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8343:
---
Labels:   (was: spark.tc)

 Improve the Spark Streaming Guides
 --

 Key: SPARK-8343
 URL: https://issues.apache.org/jira/browse/SPARK-8343
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Streaming
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.4.1, 1.5.0


 Improve the Spark Streaming Guides by fixing broken links, rewording 
 confusing sections, fixing typos, adding missing words, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7977) Disallow println

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7977:
---
Labels: starter  (was: spark.tc starter)

 Disallow println
 

 Key: SPARK-7977
 URL: https://issues.apache.org/jira/browse/SPARK-7977
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Jon Alter
  Labels: starter
 Fix For: 1.5.0


 Very often we see pull requests that added println from debugging, but the 
 author forgot to remove it before code review.
 We can use the regex checker to disallow println. For legitimate use of 
 println, we can then disable the rule where they are used.
 Add to scalastyle-config.xml file:
 {code}
   check customId=println level=error 
 class=org.scalastyle.scalariform.TokenChecker enabled=true
 parametersparameter name=regex^println$/parameter/parameters
 customMessage![CDATA[Are you sure you want to println? If yes, wrap 
 the code block with 
   // scalastyle:off println
   println(...)
   // scalastyle:on println]]/customMessage
   /check
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8570) Improve MLlib Local Matrix Documentation.

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8570:
---
Labels:   (was: spark.tc)

 Improve MLlib Local Matrix Documentation.
 -

 Key: SPARK-8570
 URL: https://issues.apache.org/jira/browse/SPARK-8570
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.5.0


 Update the MLlib Data Types Local Matrix documentation as follows:
 -Include information on sparse matrices.
 -Add sparse matrix examples to the existing Scala and Java examples.
 -Add Python examples for both dense and sparse matrices (currently no Python 
 examples exist for the Local Matrix section).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7883:
---
Labels:   (was: spark.tc)

 Fixing broken trainImplicit example in MLlib Collaborative Filtering 
 documentation.
 ---

 Key: SPARK-7883
 URL: https://issues.apache.org/jira/browse/SPARK-7883
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Trivial
 Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0


 The trainImplicit Scala example near the end of the MLlib Collaborative 
 Filtering documentation refers to an ALS.trainImplicit function signature 
 that does not exist.  Rather than add an extra function, let's just fix the 
 example.
 Currently, the example refers to a function that would have the following 
 signature: 
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: 
 Double) : MatrixFactorizationModel
 Instead, let's change the example to refer to this function, which does exist 
 (notice the addition of the lambda parameter):
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: 
 Double, alpha: Double) : MatrixFactorizationModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7426:
---
Labels:   (was: spark.tc)

 spark.ml AttributeFactory.fromStructField should allow other NumericTypes
 -

 Key: SPARK-7426
 URL: https://issues.apache.org/jira/browse/SPARK-7426
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.5.0


 It currently only supports DoubleType, but it should support others, at least 
 for fromStructField (importing into ML attribute format, rather than 
 exporting).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8639) Instructions for executing jekyll in docs/README.md could be slightly more clear, typo in docs/api.md

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8639:
---
Labels:   (was: spark.tc)

 Instructions for executing jekyll in docs/README.md could be slightly more 
 clear, typo in docs/api.md
 -

 Key: SPARK-8639
 URL: https://issues.apache.org/jira/browse/SPARK-8639
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Rosstin Murphy
Assignee: Rosstin Murphy
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 In docs/README.md, the text states around line 31
 Execute 'jekyll' from the 'docs/' directory. Compiling the site with Jekyll 
 will create a directory called '_site' containing index.html as well as the 
 rest of the compiled files.
 It might be more clear if we said
 Execute 'jekyll build' from the 'docs/' directory to compile the site. 
 Compiling the site with Jekyll will create a directory called '_site' 
 containing index.html as well as the rest of the compiled files.
 In docs/api.md: Here you can API docs for Spark and its submodules.
 should be something like: Here you can read API docs for Spark and its 
 submodules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7357) Improving HBaseTest example

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7357:
---
Labels:   (was: spark.tc)

 Improving HBaseTest example
 ---

 Key: SPARK-7357
 URL: https://issues.apache.org/jira/browse/SPARK-7357
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.3.1
Reporter: Jihong MA
Assignee: Jihong MA
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 2m
  Remaining Estimate: 2m

 Minor improvement to HBaseTest example, when Hbase related configurations 
 e.g: zookeeper quorum, zookeeper client port or zookeeper.znode.parent are 
 not set to default (localhost:2181), connection to zookeeper might hang as 
 shown in following stack
 15/03/26 18:31:20 INFO zookeeper.ZooKeeper: Initiating client connection, 
 connectString=xxx.xxx.xxx:2181 sessionTimeout=9 
 watcher=hconnection-0x322a4437, quorum=xxx.xxx.xxx:2181, baseZNode=/hbase
 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Opening socket connection to 
 server 9.30.94.121:2181. Will not attempt to authenticate using SASL (unknown 
 error)
 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Socket connection established to 
 xxx.xxx.xxx/9.30.94.121:2181, initiating session
 15/03/26 18:31:21 INFO zookeeper.ClientCnxn: Session establishment complete 
 on server xxx.xxx.xxx/9.30.94.121:2181, sessionid = 0x14c53cd311e004b, 
 negotiated timeout = 4
 15/03/26 18:31:21 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper 
 is null
 this is due to hbase-site.xml is not placed on spark class path. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8746:
---
Labels: documentation test  (was: documentation spark.tc test)

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Assignee: Christian Kadner
Priority: Trivial
  Labels: documentation, test
 Fix For: 1.4.1, 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6485:
---
Labels:   (was: spark.tc)

 Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
 --

 Key: SPARK-6485
 URL: https://issues.apache.org/jira/browse/SPARK-6485
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
 PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7744) Distributed matrix section in MLlib Data Types documentation should be reordered.

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7744:
---
Labels:   (was: spark.tc)

 Distributed matrix section in MLlib Data Types documentation should be 
 reordered.
 -

 Key: SPARK-7744
 URL: https://issues.apache.org/jira/browse/SPARK-7744
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 The documentation for BlockMatrix should come after RowMatrix, 
 IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later 
 three types, and RowMatrix is considered the basic distributed matrix.  
 This will improve comprehensibility of the Distributed matrix section, 
 especially for the new reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6785:
---
Labels:   (was: spark.tc)

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Christian Kadner
 Fix For: 1.5.0


 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5562) LDA should handle empty documents

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5562:
---
Labels: starter  (was: spark.tc starter)

 LDA should handle empty documents
 -

 Key: SPARK-5562
 URL: https://issues.apache.org/jira/browse/SPARK-5562
 Project: Spark
  Issue Type: Test
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Alok Singh
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 96h
  Remaining Estimate: 96h

 Latent Dirichlet Allocation (LDA) could easily be given empty documents when 
 people select a small vocabulary.  We should check to make sure it is robust 
 to empty documents.
 This will hopefully take the form of a unit test, but may require modifying 
 the LDA implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7265) Improving documentation for Spark SQL Hive support

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7265:
---
Labels:   (was: spark.tc)

 Improving documentation for Spark SQL Hive support 
 ---

 Key: SPARK-7265
 URL: https://issues.apache.org/jira/browse/SPARK-7265
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
Reporter: Jihong MA
Assignee: Jihong MA
Priority: Trivial
 Fix For: 1.5.0


 miscellaneous documentation improvement for Spark SQL Hive support, Yarn 
 cluster deployment. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2859) Update url of Kryo project in related docs

2015-07-14 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2859:
---
Labels:   (was: spark.tc)

 Update url of Kryo project in related docs
 --

 Key: SPARK-2859
 URL: https://issues.apache.org/jira/browse/SPARK-2859
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Guancheng Chen
Assignee: Guancheng Chen
Priority: Trivial
 Fix For: 1.0.3, 1.1.0


 Kryo project has been migrated from googlecode to github, hence we need to 
 update its URL in related docs such as tuning.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-07-13 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1403.

  Resolution: Fixed
Target Version/s:   (was: 1.5.0)

Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible without the environment? Thanks!

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-07-13 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625739#comment-14625739
 ] 

Patrick Wendell edited comment on SPARK-1403 at 7/14/15 2:59 AM:
-

Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible about the environment? Thanks!


was (Author: pwendell):
Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible without the environment? Thanks!

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-07-12 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624086#comment-14624086
]

Patrick Wendell commented on SPARK-2089:

Yeah - we can open it again later if someone who maintains this code is wanting
to work on this feature. I just want to have this JIRA reflect the current
status (i.e. for 5 versions there hasn't been any action in Spark) which is
that it is not actively being fixed and make sure the documentation correctly
reflects what we have now, to discourage the use of a feature that does not
work.

With YARN, preferredNodeLocalityData isn't honored
---

Key: SPARK-2089
URL: https://issues.apache.org/jira/browse/SPARK-2089
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

When running in YARN cluster mode, apps can pass preferred locality data when
constructing a Spark context that will dictate where to request executor
containers.
This is currently broken because of a race condition. The Spark-YARN code
runs the user class and waits for it to start up a SparkContext. During its
initialization, the SparkContext will create a YarnClusterScheduler, which
notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then
immediately fetches the preferredNodeLocationData from the SparkContext and
uses it to start requesting containers.
But in the SparkContext constructor that takes the preferredNodeLocationData,
setting preferredNodeLocationData comes after the rest of the initialization,
so, if the Spark-YARN code comes around quickly enough after being notified,
the data that's fetched is the empty unset version. The occurred during all
of my runs.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3127 matches

Mail list logo