date:20150404

[jira] [Commented] (SPARK-6699) PySpark Acess Denied error in windows seen only in ver 1.3

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396116#comment-14396116
 ] 

Sean Owen commented on SPARK-6699:
--

numpy is required: http://spark.apache.org/docs/latest/mllib-guide.html
This seems like a local env issue and that spark-submit.cmd isn't accessible to 
the user you are running as. Can you check that?

> PySpark  Acess Denied error in windows seen only in ver 1.3
> ---
>
> Key: SPARK-6699
> URL: https://issues.apache.org/jira/browse/SPARK-6699
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Windows 8.1 x64
> Windows 7 SP1 x64
>Reporter: RoCm
>
> Downloaded version 1.3 and tried to run pyspark
> I hit this error and unable to proceed (tried versions 1.2 and 1.1 works fine)
> Pasting the error logs below
> C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin>pyspark
> Running python with 
> PYTHONPATH=C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\lib\py4j-0.8.2.1-src.zip;
> C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> No module named numpy
> Traceback (most recent call last): File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\pyspark\shell.py",
>  line 50, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
> File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
>  line 108, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
> File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
>  line 222, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>  File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\java_gateway.py",
>  line 65, in launch_gateway
> proc = Popen(command, stdin=PIPE, env=env)
>   File "C:\Python27\lib\subprocess.py", line 710, in __init__errread, 
> errwrite)
>   File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
> startupinfo)
> WindowsError: [Error 5] Access is denied



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396112#comment-14396112
 ] 

Sean Owen commented on SPARK-5261:
--

I think they both come down to a minCount that is too low. If you're going to 
reopen, can you please follow up on the request to try that? or provide your 
data set? I don't think it's actionable if there's no follow-up.

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6712) Allow lower the log level in YARN client while keeping AM tracking URL printed

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6712:
---

Assignee: (was: Apache Spark)

> Allow lower the log level in YARN client while keeping AM tracking URL printed
> --
>
> Key: SPARK-6712
> URL: https://issues.apache.org/jira/browse/SPARK-6712
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Priority: Trivial
>  Labels: yarn
>
> In YARN mode, log messages are quite verbose in interactive shells 
> (spark-shell, spark-sql, pyspark), and they sometimes mingle with shell 
> prompts. In fact, it's very easy to tone it down via {{log4j.properties}}, 
> but the problem is that the AM tracking URL is not printed if I do that.
> It would be nice if I could keep the AM tracking URL while disabling the 
> other INFO messages that don't matter to most end users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6712) Allow lower the log level in YARN client while keeping AM tracking URL printed

2015-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396094#comment-14396094
 ] 

Apache Spark commented on SPARK-6712:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/5362

> Allow lower the log level in YARN client while keeping AM tracking URL printed
> --
>
> Key: SPARK-6712
> URL: https://issues.apache.org/jira/browse/SPARK-6712
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Priority: Trivial
>  Labels: yarn
>
> In YARN mode, log messages are quite verbose in interactive shells 
> (spark-shell, spark-sql, pyspark), and they sometimes mingle with shell 
> prompts. In fact, it's very easy to tone it down via {{log4j.properties}}, 
> but the problem is that the AM tracking URL is not printed if I do that.
> It would be nice if I could keep the AM tracking URL while disabling the 
> other INFO messages that don't matter to most end users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6712) Allow lower the log level in YARN client while keeping AM tracking URL printed

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6712:
---

Assignee: Apache Spark

> Allow lower the log level in YARN client while keeping AM tracking URL printed
> --
>
> Key: SPARK-6712
> URL: https://issues.apache.org/jira/browse/SPARK-6712
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: yarn
>
> In YARN mode, log messages are quite verbose in interactive shells 
> (spark-shell, spark-sql, pyspark), and they sometimes mingle with shell 
> prompts. In fact, it's very easy to tone it down via {{log4j.properties}}, 
> but the problem is that the AM tracking URL is not printed if I do that.
> It would be nice if I could keep the AM tracking URL while disabling the 
> other INFO messages that don't matter to most end users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6712) Allow lower the log level in YARN client while keeping AM tracking URL printed

2015-04-04 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created SPARK-6712:


 Summary: Allow lower the log level in YARN client while keeping AM 
tracking URL printed
 Key: SPARK-6712
 URL: https://issues.apache.org/jira/browse/SPARK-6712
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Trivial


In YARN mode, log messages are quite verbose in interactive shells 
(spark-shell, spark-sql, pyspark), and they sometimes mingle with shell 
prompts. In fact, it's very easy to tone it down via {{log4j.properties}}, but 
the problem is that the AM tracking URL is not printed if I do that.

It would be nice if I could keep the AM tracking URL while disabling the other 
INFO messages that don't matter to most end users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6711) Support parallelized online matrix factorization for Collaborative Filtering

2015-04-04 Thread Chunnan Yao (JIRA)

Chunnan Yao created SPARK-6711:
--

 Summary: Support parallelized online matrix factorization for 
Collaborative Filtering 
 Key: SPARK-6711
 URL: https://issues.apache.org/jira/browse/SPARK-6711
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chunnan Yao


On-line Collaborative Filtering(CF) has been widely used and studied. To 
re-train a CF model from scratch every time when new data comes in is very 
inefficient 
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
 However, in Spark community we see few discussion about collaborative 
filtering on streaming data. Given streaming k-means, streaming logistic 
regression, and the on-going incremental model training of Naive Bayes 
Classifier (SPARK-4144), we think it is meaningful to consider streaming 
Collaborative Filtering support on MLlib. 

We have already been considering about this issue during the past week. We plan 
to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
SGD instead of ALS, which is easier to be tackled under streaming data. 

Fortunately, the authors of this paper have implemented their algorithm as a 
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6699) PySpark Acess Denied error in windows seen only in ver 1.3

2015-04-04 Thread RoCm (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396040#comment-14396040
 ] 

RoCm commented on SPARK-6699:
-

I dont think this is a problem caused by missing numpy module.
Didnt think  pyspark ever needs to have a numpy dependency. But correct me if 
I'm wrong.

To confirm this was not caused by not having numpy I switched my PYTHONPATH to 
different python installation 
(which had numpy)

And running pyspark with new python interpreter  also gives me the same error

>From the traceback following statment in java_gateway.py is causing this.
proc = Popen(command, stdin=PIPE, env=env)


C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin>pyspark
Running python with 
PYTHONPATH=C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python;
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Jul  2 2014, 15:12:11) [MSC 
v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
Traceback (most recent call last):
  File 
"C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\pyspark\shell.py",
 line 50, in 
sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File 
"C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
 line 108, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File 
"C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
 line 222, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
"C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\java_gateway.py",
 line 65, in launch_gateway
proc = Popen(command, stdin=PIPE, env=env)
  File "C:\Users\roXYZ\Anaconda\lib\subprocess.py", line 710, in __init__
errread, errwrite)
  File "C:\Users\roXYZ\Anaconda\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 5] Access is denied


> PySpark  Acess Denied error in windows seen only in ver 1.3
> ---
>
> Key: SPARK-6699
> URL: https://issues.apache.org/jira/browse/SPARK-6699
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Windows 8.1 x64
> Windows 7 SP1 x64
>Reporter: RoCm
>
> Downloaded version 1.3 and tried to run pyspark
> I hit this error and unable to proceed (tried versions 1.2 and 1.1 works fine)
> Pasting the error logs below
> C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin>pyspark
> Running python with 
> PYTHONPATH=C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\lib\py4j-0.8.2.1-src.zip;
> C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> No module named numpy
> Traceback (most recent call last): File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\pyspark\shell.py",
>  line 50, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
> File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
>  line 108, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
> File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py",
>  line 222, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>  File 
> "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\java_gateway.py",
>  line 65, in launch_gateway
> proc = Popen(command, stdin=PIPE, env=env)
>   File "C:\Python27\lib\subprocess.py", line 710, in __init__errread, 
> errwrite)
>   File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
> startupinfo)
> WindowsError: [Error 5] Access is denied



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-04 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li reopened SPARK-5261:


[~srowen] SPARK-5261 andSPARK-4846 are not the same problem. This is a  
algorithm error.  The resulting vector is incorrect. 

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6703:
-
Target Version/s: 1.4.0

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6703:
-
Affects Version/s: 1.3.0

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395948#comment-14395948
 ] 

Sean Owen commented on SPARK-6676:
--

I'd prefer to start by just improving the documentation to be clear about this, 
both in the site docs and in comments in the build file, rather than copy and 
paste the profiles. For example 
http://spark.apache.org/docs/latest/building-spark.html should say this profile 
is for "2.4.x+" and comments in the pom.xml can clarify to anyone looking there 
what the profile is for. I think few people need to build Spark for themselves.

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-04 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Description: 
Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to provide a rendez-vous point so that applications can learn 
whether an existing SparkContext already exists before creating one.

The most simple/surgical way I see to do this is to have an optional static 
SparkContext singleton that people can be retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.


  was:
Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to have a way to write an application where you can "get or 
create" a SparkContext and have some standard type of synchronization point 
application authors can access. The most simple/surgical way I see to do this 
is to have an optional static SparkContext singleton that people can be 
retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.



> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-04 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395932#comment-14395932
 ] 

Patrick Wendell commented on SPARK-6676:


[~srowen] This is such a common source of confusion for users, do you think we 
should just add 2.5 and 2.6 profiles and add a note internally that they are 
duplicates of 2.4? The maintenance cost there is pretty marginal and it might 
be better user experience, since this is something people clearly regularly 
stumble on.

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6661) Python type errors should print type, not object

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6661:
---

Assignee: (was: Apache Spark)

> Python type errors should print type, not object
> 
>
> Key: SPARK-6661
> URL: https://issues.apache.org/jira/browse/SPARK-6661
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In MLlib PySpark, we sometimes test the type of an object and print an error 
> if the object is of the wrong type.  E.g.:
> [https://github.com/apache/spark/blob/f084c5de14eb10a6aba82a39e03e7877926ebb9e/python/pyspark/mllib/regression.py#L173]
> These checks should print the type, not the actual object.  E.g., if the 
> object cannot be converted to a string, then the check linked above will give 
> a warning like this:
> {code}
> TypeError: not all arguments converted during string formatting
> {code}
> ...which is weird for the user.
> There may be other places in the codebase where this is an issue, so we need 
> to check through and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6661) Python type errors should print type, not object

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6661:
---

Assignee: Apache Spark

> Python type errors should print type, not object
> 
>
> Key: SPARK-6661
> URL: https://issues.apache.org/jira/browse/SPARK-6661
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In MLlib PySpark, we sometimes test the type of an object and print an error 
> if the object is of the wrong type.  E.g.:
> [https://github.com/apache/spark/blob/f084c5de14eb10a6aba82a39e03e7877926ebb9e/python/pyspark/mllib/regression.py#L173]
> These checks should print the type, not the actual object.  E.g., if the 
> object cannot be converted to a string, then the check linked above will give 
> a warning like this:
> {code}
> TypeError: not all arguments converted during string formatting
> {code}
> ...which is weird for the user.
> There may be other places in the codebase where this is an issue, so we need 
> to check through and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6661) Python type errors should print type, not object

2015-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395930#comment-14395930
 ] 

Apache Spark commented on SPARK-6661:
-

User '31z4' has created a pull request for this issue:
https://github.com/apache/spark/pull/5361

> Python type errors should print type, not object
> 
>
> Key: SPARK-6661
> URL: https://issues.apache.org/jira/browse/SPARK-6661
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In MLlib PySpark, we sometimes test the type of an object and print an error 
> if the object is of the wrong type.  E.g.:
> [https://github.com/apache/spark/blob/f084c5de14eb10a6aba82a39e03e7877926ebb9e/python/pyspark/mllib/regression.py#L173]
> These checks should print the type, not the actual object.  E.g., if the 
> object cannot be converted to a string, then the check linked above will give 
> a warning like this:
> {code}
> TypeError: not all arguments converted during string formatting
> {code}
> ...which is weird for the user.
> There may be other places in the codebase where this is an issue, so we need 
> to check through and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages

2015-04-04 Thread Harsh Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395888#comment-14395888
 ] 

Harsh Gupta commented on SPARK-3580:


Hi. Can I take it up ?

> Add Consistent Method To Get Number of RDD Partitions Across Different 
> Languages
> 
>
> Key: SPARK-3580
> URL: https://issues.apache.org/jira/browse/SPARK-3580
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Pat McDonough
>  Labels: starter
>
> Programmatically retrieving the number of partitions is not consistent 
> between python and scala. A consistent method should be defined and made 
> public across both languages.
> RDD.partitions.size is also used quite frequently throughout the internal 
> code, so that might be worth refactoring as well once the new method is 
> available.
> What we have today is below.
> In Scala:
> {code}
> scala> someRDD.partitions.size
> res0: Int = 30
> {code}
> In Python:
> {code}
> In [2]: someRDD.getNumPartitions()
> Out[2]: 30
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions

2015-04-04 Thread Harsh Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395876#comment-14395876
 ] 

Harsh Gupta commented on SPARK-6548:


Is this available to be worked upon ? The assignee shows as unassigned ?

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus

2015-04-04 Thread Michael Malak (JIRA)

Michael Malak created SPARK-6710:


 Summary: Wrong initial bias in GraphX SVDPlusPlus
 Key: SPARK-6710
 URL: https://issues.apache.org/jira/browse/SPARK-6710
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Michael Malak


In the initialization portion of GraphX SVDPlusPluS, the initialization of 
biases appears to be incorrect. Specifically, in line 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
 
instead of 
(vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) 
it should probably be 
(vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / scala.math.sqrt(msg.get._1)) 

That is, the biases bu and bi (both represented as the third component of the 
Tuple4[] above, depending on whether the vertex is a user or an item), 
described in equation (1) of the Koren paper, are supposed to be small offsets 
to the mean (represented by the variable u, signifying the Greek letter mu) to 
account for peculiarities of individual users and items. 

Initializing these biases to wrong values should theoretically not matter given 
enough iterations of the algorithm, but some quick empirical testing shows it 
has trouble converging at all, even after many orders of magnitude additional 
iterations. 

This perhaps could be the source of previously reported trouble with 
SVDPlusPlus. 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-975) Spark Replay Debugger

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-975:
--

Assignee: Apache Spark

> Spark Replay Debugger
> -
>
> Key: SPARK-975
> URL: https://issues.apache.org/jira/browse/SPARK-975
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: arthur, debugger
> Attachments: IMG_20140722_184149.jpg, RDD DAG.png
>
>
> The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
> report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
> [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
> Dave|https://github.com/ankurdave], is an old implementation of the Spark 
> debugger, which demonstrated both the elegance and power behind the RDD 
> abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
> into the master branch and had stopped 2 years ago.  For more information 
> about Arthur, please refer to [the Spark Debugger Wiki 
> page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
> repository.
> As a useful tool for Spark application debugging and analysis, it would be 
> nice to have a complete Spark debugger.  In 
> [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
> implementation of the Spark debugger, the Spark Replay Debugger (SRD).
> [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
> for discussion.  In the current version, I only implemented features that can 
> illustrate the basic mechanisms.  There are still features appeared in Arthur 
> but missing in SRD, such as checksum based nondeterminsm detection and single 
> task debugging with conventional debugger (like {{jdb}}).  However, these 
> features can be easily built upon current SRD framework.  To minimize code 
> review effort, I didn't include them into the current version intentionally.
> Attached is the visualization of the MLlib ALS application (with 1 iteration) 
> generated by SRD.  For more information, please refer to [the SRD overview 
> document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-975) Spark Replay Debugger

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-975:
--

Assignee: (was: Apache Spark)

> Spark Replay Debugger
> -
>
> Key: SPARK-975
> URL: https://issues.apache.org/jira/browse/SPARK-975
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Cheng Lian
>  Labels: arthur, debugger
> Attachments: IMG_20140722_184149.jpg, RDD DAG.png
>
>
> The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
> report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
> [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
> Dave|https://github.com/ankurdave], is an old implementation of the Spark 
> debugger, which demonstrated both the elegance and power behind the RDD 
> abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
> into the master branch and had stopped 2 years ago.  For more information 
> about Arthur, please refer to [the Spark Debugger Wiki 
> page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
> repository.
> As a useful tool for Spark application debugging and analysis, it would be 
> nice to have a complete Spark debugger.  In 
> [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
> implementation of the Spark debugger, the Spark Replay Debugger (SRD).
> [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
> for discussion.  In the current version, I only implemented features that can 
> illustrate the basic mechanisms.  There are still features appeared in Arthur 
> but missing in SRD, such as checksum based nondeterminsm detection and single 
> task debugging with conventional debugger (like {{jdb}}).  However, these 
> features can be easily built upon current SRD framework.  To minimize code 
> review effort, I didn't include them into the current version intentionally.
> Attached is the visualization of the MLlib ALS application (with 1 iteration) 
> generated by SRD.  For more information, please refer to [the SRD overview 
> document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2698) RDD pages shows negative bytes remaining for some executors

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2698:
-
Priority: Minor  (was: Major)

This might have been fixed at some point since 1.0 though I don't see an 
obvious candidate. The problem is that "memUsed" exceeds "maxMem" in 
{{StorageStatus}}. It doesn't look like mere rounding error since the 
difference is >700MB out of 9.6GB. 

It's easy to make {{memRemaining}} never return a negative value, which is on 
the one hand good defensive programming but might be hiding a race condition or 
other logic error.

Any thoughts on whether that is more good than harm?

> RDD pages shows negative bytes remaining for some executors
> ---
>
> Key: SPARK-2698
> URL: https://issues.apache.org/jira/browse/SPARK-2698
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Hossein Falaki
>Priority: Minor
> Attachments: spark ui.png
>
>
> The RDD page shows negative bytes remaining for some executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6607.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5263
[https://github.com/apache/spark/pull/5263]

> Aggregation attribute name including special chars '(' and ')' should be 
> replaced before generating Parquet schema
> --
>
> Key: SPARK-6607
> URL: https://issues.apache.org/jira/browse/SPARK-6607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.1, 1.3.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> '(' and ')' are special characters used in Parquet schema for type 
> annotation. When we run an aggregation query, we will obtain attribute name 
> such as "MAX(a)".
> If we directly store the generated DataFrame as Parquet file, it causes 
> failure when reading and parsing the stored schema string.
> Several methods can be adopted to solve this. This pr uses a simplest one to 
> just replace attribute names before generating Parquet schema based on these 
> attributes.
> Another possible method might be modifying all aggregation expression names 
> from "func(column)" to "func[column]".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3413) Spark Blocked due to Executor lost in FIFO MODE

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3413.
--
Resolution: Cannot Reproduce

> Spark Blocked due to Executor lost in FIFO MODE
> ---
>
> Key: SPARK-3413
> URL: https://issues.apache.org/jira/browse/SPARK-3413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.2
>Reporter: Patrick Liu
>
> I run spark on yarn.
> Spark scheduler is running in FIFO mode.
> I have 80 worker instances setup. However, as time passes, some worker will 
> be lost. (Killed by JVM when OOM, etc).
> But some tasks will still run in those executors. 
> Obviously the task will never finished.
> Then the stage will not finish. So the later stages will be blocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395793#comment-14395793
 ] 

Sean Owen commented on SPARK-3553:
--

Checking through old issues --  I know this logic has been updated since 1.0 
and fixed in changes like SPARK-4518 and SPARK-2362. Any chance you know 
whether it is still an issue? It would not surprise me if it's fixed.

Otherwise, do you know if the file modification times were changed by your 
process?
Debug log output would help too since it explains its logic in what to keep in 
the debug messages.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4830.
--
Resolution: Duplicate

Yes, I strongly suspect this was resolved by SPARK-4660, since it would correct 
the classloader used to resolve the classes in this case as well via 
{{JavaDeserializationStream}}. I don't see any other likely reason that the 
user class would not be found. I think we can reopen if there's evidence it 
still persists in 1.3+. Given the lack of activity I doubt a different next 
step is otherwise coming.

> Spark Streaming Java Application : java.lang.ClassNotFoundException
> ---
>
> Key: SPARK-4830
> URL: https://issues.apache.org/jira/browse/SPARK-4830
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Mykhaylo Telizhyn
>
> h4. Application Overview:
>   
>We have Spark Streaming application that consumes messages from 
> RabbitMQ and processes them. When generating hundreds of events on RabbitMQ 
> and running our application on spark standalone cluster we see some 
> {{java.lang.ClassNotFoundException}} exceptions in the log. 
> Our domain model is simple POJO that represents RabbitMQ events we want to 
> consume and contains some custom properties we are interested in: 
> {code:title=com.xxx.Event.java|borderStyle=solid}
> public class Event implements java.io.Externalizable {
> 
> // custom properties
> // custom implementation of writeExternal(), readExternal() 
> methods
> }
> {code}
> We have implemented custom Spark Streaming receiver that just 
> receives messages from RabbitMQ queue by means of custom consumer (See 
> _"Receiving messages by subscription"_ at 
> https://www.rabbitmq.com/api-guide.html), converts them to our custom domain 
> event objects ({{com.xxx.Event}}) and stores them on spark memory:
> {code:title=RabbitMQReceiver.java|borderStyle=solid}
> byte[] body = // data received from Rabbit using custom consumer
> Event event = new Event(body);
> store(event)  // store into Spark  
> {code}
> The main program is simple, it just set up spark streaming context:
> {code:title=Application.java|borderStyle=solid}
> SparkConf sparkConf = new 
> SparkConf().setAppName(APPLICATION_NAME);
> 
> sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
> new Duration(BATCH_DURATION_MS));
> {code}
> Initialize input streams:
> {code:title=Application.java|borderStyle=solid}
> ReceiverInputDStream stream = // create input stream from 
> RabbitMQ
> JavaReceiverInputDStream events = new 
> JavaReceiverInputDStream(stream, classTag(Event.class));
> {code}
> Process events:
> {code:title=Application.java|borderStyle=solid}
> events.foreachRDD(
> rdd -> {
> rdd.foreachPartition(
> partition -> {
>  
> // process partition
> }
> }
> })
> 
> ssc.start();
> ssc.awaitTermination();
> {code}
> h4. Application submission:
> 
> Application is packaged as a single fat jar file using maven shade 
> plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled 
> with spark version _1.1.0_   
> We run our application on spark version _1.1.0_ standalone cluster 
> that consists of driver host, master host and two worker hosts. We submit 
> application from driver host.
> 
> On one of the workers we see {{java.lang.ClassNotFoundException}} 
> exceptions:   
> {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff}
> 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
> java.lang.ClassNotFoundException: com.xxx.Event
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:344)
> at 
> org.apache.spark.serializer.

[jira] [Updated] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4900:
-
Description: 
java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
Please refer ARPACK user guide for error message.
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
...

  was:

java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
Please refer ARPACK user guide for error message.
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
...

   Priority: Minor  (was: Major)

> MLlib SingularValueDecomposition ARPACK IllegalStateException 
> --
>
> Key: SPARK-4900
> URL: https://issues.apache.org/jira/browse/SPARK-4900
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0, 1.2.1
> Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
> 25.25-b02, mixed mode)
> spark local mode
>Reporter: Mike Beyer
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0
>
>
> java.lang.reflect.InvocationTargetException
> ...
> Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
> Please refer ARPACK user guide for error message.
> at 
> org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
>   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4900.
--
Resolution: Fixed

Given lack of follow-up I think this is as resolved as we're going to make it.

> MLlib SingularValueDecomposition ARPACK IllegalStateException 
> --
>
> Key: SPARK-4900
> URL: https://issues.apache.org/jira/browse/SPARK-4900
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0, 1.2.1
> Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
> 25.25-b02, mixed mode)
> spark local mode
>Reporter: Mike Beyer
>Assignee: Sean Owen
> Fix For: 1.3.0
>
>
> java.lang.reflect.InvocationTargetException
> ...
> Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
> Please refer ARPACK user guide for error message.
> at 
> org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
>   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6709) SparkSQL cannot parse sql correctly when the table contains "count" column.

2015-04-04 Thread Patrick Liu (JIRA)

Patrick Liu created SPARK-6709:
--

 Summary: SparkSQL cannot parse sql correctly when the table 
contains "count" column.
 Key: SPARK-6709
 URL: https://issues.apache.org/jira/browse/SPARK-6709
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.1, 1.2.0, 1.1.1, 1.1.0
Reporter: Patrick Liu


bin/spark-shell
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlContext.implicits._
scala> case class what(id: Int, count: Int)
scala> val whats = sc.parallelize( 0 to 10).map(x => what(x, x*10)).toDF()
scala> whats.registerTempTable("whats")
scala> sqlContext.sql("select * from whats where count < 20").collect

Error Log:
scala> sqlContext.sql("select * from whats where count < 20").collect
java.lang.RuntimeException: [1.33] failure: ``('' expected but `<' found

select * from whats where count < 20
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:915)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:40)
at $iwC$$iwC$$iwC$$iwC.(:42)
at $iwC$$iwC$$iwC.(:44)
at $iwC$$iwC.(:46)
at $iwC.(:48)
at (:50)
at .(:54)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at 
org.apache.spark.repl.SparkILoop.interpre

[jira] [Resolved] (SPARK-5033) Spark 1.1.0/1.1.1/1.2.0 can't run well in HDP on Windows

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5033.
--
Resolution: Duplicate

> Spark 1.1.0/1.1.1/1.2.0 can't run well in HDP on Windows
> 
>
> Key: SPARK-5033
> URL: https://issues.apache.org/jira/browse/SPARK-5033
> Project: Spark
>  Issue Type: Bug
>  Components: Windows, YARN
>Affects Versions: 1.1.0, 1.2.0
> Environment: HDInsight 3.1 in Azure
>Reporter: Rice
>  Labels: easyfix
>
> After installation, when I ran .\bin\spark-shell --master yarn, YarnClient 
> will report Error in running commands like the followings:
> %JAVA_HOME%/bin/java -server -cp 
> %CLASSPATH%;C:\hdp\spark-1.1.1\lib\spark-assembly-1.1.1-hadoop2.4.0.jar 
> -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
> '-Dspark.tachyonStore.folderName=spark-919783cd-bdf7-4e6b-86bf-011244e4a49f' 
> '-Dspark.yarn.secondary.jars=' 
> '-Dspark.repl.class.uri=http://192.168.0.13:12972' 
> '-Dspark.driver.host=HOME-HYPERVS' '-Dspark.driver.appUIHistoryAddress=' 
> '-Dspark.app.name=Spark shell' 
> '-Dspark.driver.appUIAddress=HOME-HYPERVS:4040' '-Dspark.jars=' 
> '-Dspark.fileserver.uri=http://192.168.0.13:12992' 
> '-Dspark.master=yarn-client' '-Dspark.driver.port=12988' 
> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  null  
> --arg  'HOME-HYPERVS:12988' --executor-memory 1024 --executor-cores 1 
> --num-executors  2 
> It will run error because of single quote instead of double quote. The 
> following file will need to be modified in File YarnSparkHadoopUtil.scala:
> def escapeForShell(arg: String): String = {
> if (arg != null) {
>   val escaped = new StringBuilder("'")
>   for (i <- 0 to arg.length() - 1) {
> arg.charAt(i) match {
>   case '$' => escaped.append("\\$")
>   case '"' => escaped.append("\\\"")
>   case '\'' => escaped.append("'\\''")
>   case c => escaped.append(c)
> }
>   }
>   escaped.append("'").toString()
> } else {
>   arg
> }
>   }
> After modification from single quote to doulbe quote, the command is OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5261.
--
Resolution: Duplicate

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3276:
-
Summary: Provide a API to specify MIN_REMEMBER_DURATION for files to 
consider as input in streaming  (was: Provide a API to specify whether the old 
files need to be ignored in file input text DStream)

See SPARK-6061. I'm going to hijack this slightly to suggest that the real 
issue is not being able to choose to see only new files, which you already can, 
but controlling {{MIN_REMEMBER_DURATION}}, which is currently hard-coded.

> Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input 
> in streaming
> --
>
> Key: SPARK-3276
> URL: https://issues.apache.org/jira/browse/SPARK-3276
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Jack Hu
>Priority: Minor
>
> Currently, only one API called textFileStream in StreamingContext to specify 
> the text file dstream, which ignores the old files always. On some times, the 
> old files is still useful.
> Need a API to let user choose whether the old files need to be ingored or not 
> .
> The API currently in StreamingContext:
> def textFileStream(directory: String): DStream[String] = {
> fileStream[LongWritable, Text, 
> TextInputFormat](directory).map(_._2.toString)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6061.
--
Resolution: Duplicate

This reduces to the same underlying issue as in SPARK-3276, which is to 
configure {{MIN_REMEMBER_DURATION}}

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395780#comment-14395780
 ] 

Sean Owen commented on SPARK-6399:
--

Hey Marcelo where would you like to document this, maybe in the programming 
guide? I agree that this is at best an issue to document.

> Code compiled against 1.3.0 may not run against older Spark versions
> 
>
> Key: SPARK-6399
> URL: https://issues.apache.org/jira/browse/SPARK-6399
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>
> Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
> easier to use. The problem is that scalac now generates code that will not 
> run on older Spark versions if those conversions are used.
> Basically, even if you explicitly import {{SparkContext._}}, scalac will 
> generate references to the new methods in the {{RDD}} object instead. So the 
> compiled code will reference code that doesn't exist in older versions of 
> Spark.
> You can work around this by explicitly calling the methods in the 
> {{SparkContext}} object, although that's a little ugly.
> We should at least document this limitation (if there's no way to fix it), 
> since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6622) Spark SQL cannot communicate with Hive meta store

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6622:
-
Component/s: (was: Spark Submit)
 SQL

> Spark SQL cannot communicate with Hive meta store
> -
>
> Key: SPARK-6622
> URL: https://issues.apache.org/jira/browse/SPARK-6622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Deepak Kumar V
>  Labels: Hive
> Attachments: exception.txt
>
>
> I have multiple tables (among them is dw_bid) that are created through Apache 
> Hive.  I have data in avro on HDFS that i want to join with dw_bid table, 
> this join needs to be done using Spark SQL.  
> Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
> exception
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://hostname.vip. company.com:3306/HDB
>   at java.sql.DriverManager.getConnection(DriverManager.java:596)
> Spark Submit Command
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path 
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>  --jars 
> /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
>  --num-executors 1 --driver-memory 4g --driver-java-options 
> "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
> hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
> spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
> MySQL Java Conector Versions tried
> mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
> folder)
> mysql-connector-java-5.1.34.jar
> mysql-connector-java-5.1.35.jar
> Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
> (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)
> $ hive --version
> Hive 0.13.0.2.1.3.6-2
> Subversion 
> git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
>  -r 87da9430050fb9cc429d79d95626d26ea382b96c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5654) Integrate SparkR into Apache Spark

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5654:
-
Component/s: (was: Project Infra)
 SparkR

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6704) integrate SparkR docs build tool into Spark doc build

2015-04-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6704:
-
Component/s: SparkR

(I went ahead and made a SparkR component in JIRA.)

> integrate SparkR docs build tool into Spark doc build
> -
>
> Key: SPARK-6704
> URL: https://issues.apache.org/jira/browse/SPARK-6704
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>Priority: Blocker
>
> We should integrate the SparkR docs build tool into Spark one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6587:
--
Comment: was deleted

(was: JSON needs this kind of schema inference because JSON is weakly typed. 
The JSON sample you provided is actually considered as dirty data rather than 
OO-like "polymorphism". So the type reconciliation in case of JSON is designed 
to deal with dirty data. Scala case classes are already well typed, so there 
shouldn't be such kind of dirty, conflicting data.

I think the thing you're looking for is actually [union types in 
Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes],
 which unfortunately is not supported in Spark SQL yet.)

> Inferring schema for case class hierarchy fails with mysterious message
> ---
>
> Key: SPARK-6587
> URL: https://issues.apache.org/jira/browse/SPARK-6587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: At least Windows 8, Scala 2.11.2.  
>Reporter: Spiro Michaylov
>
> (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
> I define the following hierarchy:
> {code}
> private abstract class MyHolder
> private case class StringHolder(s: String) extends MyHolder
> private case class IntHolder(i: Int) extends MyHolder
> private case class BooleanHolder(b: Boolean) extends MyHolder
> {code}
> and a top level case class:
> {code}
> private case class Thing(key: Integer, foo: MyHolder)
> {code}
> When I try to convert it:
> {code}
> val things = Seq(
>   Thing(1, IntHolder(42)),
>   Thing(2, StringHolder("hello")),
>   Thing(3, BooleanHolder(false))
> )
> val thingsDF = sc.parallelize(things, 4).toDF()
> thingsDF.registerTempTable("things")
> val all = sqlContext.sql("SELECT * from things")
> {code}
> I get the following stack trace:
> {noformat}
> Exception in thread "main" scala.MatchError: 
> sql.CaseClassSchemaProblem.MyHolder (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
>   at scala.collection.immutable.List.map(List.scala:276)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
>   at 
> org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
>   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
>   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {noformat}
> I wrote this to answer [a question on 
> StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
>  which uses a much simpler approach and suffers the same problem.
> Looking at what seems to me to be the [relevant unit test 
> suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
>  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-04-04 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395765#comment-14395765
 ] 

Cheng Lian commented on SPARK-6587:
---

JSON needs this kind of schema inference because JSON is weakly typed. The JSON 
sample you provided is actually considered as dirty data rather than OO-like 
"polymorphism". So the type reconciliation in case of JSON is designed to deal 
with dirty data. Scala case classes are already well typed, so there shouldn't 
be such kind of dirty, conflicting data.

I think the thing you're looking for is actually [union types in 
Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes],
 which unfortunately is not supported in Spark SQL yet.

> Inferring schema for case class hierarchy fails with mysterious message
> ---
>
> Key: SPARK-6587
> URL: https://issues.apache.org/jira/browse/SPARK-6587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: At least Windows 8, Scala 2.11.2.  
>Reporter: Spiro Michaylov
>
> (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
> I define the following hierarchy:
> {code}
> private abstract class MyHolder
> private case class StringHolder(s: String) extends MyHolder
> private case class IntHolder(i: Int) extends MyHolder
> private case class BooleanHolder(b: Boolean) extends MyHolder
> {code}
> and a top level case class:
> {code}
> private case class Thing(key: Integer, foo: MyHolder)
> {code}
> When I try to convert it:
> {code}
> val things = Seq(
>   Thing(1, IntHolder(42)),
>   Thing(2, StringHolder("hello")),
>   Thing(3, BooleanHolder(false))
> )
> val thingsDF = sc.parallelize(things, 4).toDF()
> thingsDF.registerTempTable("things")
> val all = sqlContext.sql("SELECT * from things")
> {code}
> I get the following stack trace:
> {noformat}
> Exception in thread "main" scala.MatchError: 
> sql.CaseClassSchemaProblem.MyHolder (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
>   at scala.collection.immutable.List.map(List.scala:276)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
>   at 
> org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
>   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
>   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {noformat}
> I wrote this to answer [a question on 
> StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
>  which uses a much simpler approach and suffers the same problem.
> Looking at what seems to me to be the [relevant unit test 
> suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
>  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395773#comment-14395773
 ] 

Sean Owen commented on SPARK-6695:
--

The problem with this in the general case raises questions like, where are you 
allowed to spill and how much? It has to be cleaned up but how do you know when 
it can be? Evaluation happens at some future point. It also gets much slower, 
which may not solve much.

(As an aside for this particular function, it makes me think that your settings 
aren't causing it to do much sampling at all. I think the partial products this 
returns for each row are intended to be pretty sparse. If you're running out of 
memory then that is likely the problem?)

I think that in general, a function that uses a large amount of interim memory 
is going to get into trouble in Spark and a bunch of I/O pushes the problem 
around. For example, it might be possible here to decompose the overall flatMap 
over an iterator of rows into a flatMap of a flatMapping of the row, each of 
which emits partial products from just one element of the row. I think you'd 
get for free a much lower amount of max memory usage, but I haven't thought it 
through 100%

> Add an external iterator: a hadoop-like output collector
> 
>
> Key: SPARK-6695
> URL: https://issues.apache.org/jira/browse/SPARK-6695
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: uncleGen
>
> In practical use, we usually need to create a big iterator, which means too 
> big in `memory usage` or too long in `array size`. On the one hand, it leads 
> to too much memory consumption. On the other hand, one `Array` may not hold 
> all the elements, as java array indices are of type 'int' (4 bytes or 32 
> bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
> any others, and could spill data into disk. The use case may like:
> {code: borderStyle=solid}
>rdd.mapPartition { it => 
>   ...
>   val collector = new ExternalCollector()
>   collector.collect(a)
>   ...
>   collector.iterator
>   }
>
> {code}
> I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-04-04 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395766#comment-14395766
 ] 

Cheng Lian commented on SPARK-6587:
---

JSON needs this kind of schema inference because JSON is weakly typed. The JSON 
sample you provided is actually considered as dirty data rather than OO-like 
"polymorphism". So the type reconciliation in case of JSON is designed to deal 
with dirty data. Scala case classes are already well typed, so there shouldn't 
be such kind of dirty, conflicting data.

I think the thing you're looking for is actually [union types in 
Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes],
 which unfortunately is not supported in Spark SQL yet.

> Inferring schema for case class hierarchy fails with mysterious message
> ---
>
> Key: SPARK-6587
> URL: https://issues.apache.org/jira/browse/SPARK-6587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: At least Windows 8, Scala 2.11.2.  
>Reporter: Spiro Michaylov
>
> (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
> I define the following hierarchy:
> {code}
> private abstract class MyHolder
> private case class StringHolder(s: String) extends MyHolder
> private case class IntHolder(i: Int) extends MyHolder
> private case class BooleanHolder(b: Boolean) extends MyHolder
> {code}
> and a top level case class:
> {code}
> private case class Thing(key: Integer, foo: MyHolder)
> {code}
> When I try to convert it:
> {code}
> val things = Seq(
>   Thing(1, IntHolder(42)),
>   Thing(2, StringHolder("hello")),
>   Thing(3, BooleanHolder(false))
> )
> val thingsDF = sc.parallelize(things, 4).toDF()
> thingsDF.registerTempTable("things")
> val all = sqlContext.sql("SELECT * from things")
> {code}
> I get the following stack trace:
> {noformat}
> Exception in thread "main" scala.MatchError: 
> sql.CaseClassSchemaProblem.MyHolder (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
>   at scala.collection.immutable.List.map(List.scala:276)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
>   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
>   at 
> org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
>   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
>   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {noformat}
> I wrote this to answer [a question on 
> StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
>  which uses a much simpler approach and suffers the same problem.
> Looking at what seems to me to be the [relevant unit test 
> suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
>  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395763#comment-14395763
 ] 

Sean Owen commented on SPARK-6569:
--

Let's make it info level then.

> Kafka directInputStream logs what appear to be incorrect warnings
> -
>
> Key: SPARK-6569
> URL: https://issues.apache.org/jira/browse/SPARK-6569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
> Environment: Spark 1.3.0
>Reporter: Platon Potapov
>Priority: Minor
>
> During what appears to be normal operation of streaming from a Kafka topic, 
> the following log records are observed, logged periodically:
> {code}
> [Stage 391:==>  (3 + 0) / 
> 4]
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> {code}
> * the part.fromOffset placeholder is not correctly substituted to a value
> * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6708) Using Hive UDTF may throw ClassNotFoundException

2015-04-04 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-6708:
-

 Summary: Using Hive UDTF may throw ClassNotFoundException
 Key: SPARK-6708
 URL: https://issues.apache.org/jira/browse/SPARK-6708
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.1, 1.1.1
Reporter: Cheng Lian


Spark shell session for reproducing this issue:
{code}
import sqlContext._

sql("create table t1 (str string)")
sql("select v.va from t1 lateral view json_tuple(str, 'a') v as 
va").queryExecution.analyzed
{code}
Exception thrown:
{noformat}
java.lang.ClassNotFoundException: json_tuple
at 
scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:148)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:280)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:280)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:285)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:285)
at 
org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:291)
at 
org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
at 
org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
at 
org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:60)
at 
org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:117)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:292)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:251)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
at scala.collection.Iterator$$

[jira] [Assigned] (SPARK-6262) Python MLlib API missing items: Statistics

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6262:
---

Assignee: Kai Sasaki  (was: Apache Spark)

> Python MLlib API missing items: Statistics
> --
>
> Key: SPARK-6262
> URL: https://issues.apache.org/jira/browse/SPARK-6262
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> MultivariateStatisticalSummary
> * normL1
> * normL2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6262) Python MLlib API missing items: Statistics

2015-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395721#comment-14395721
 ] 

Apache Spark commented on SPARK-6262:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5359

> Python MLlib API missing items: Statistics
> --
>
> Key: SPARK-6262
> URL: https://issues.apache.org/jira/browse/SPARK-6262
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> MultivariateStatisticalSummary
> * normL1
> * normL2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6262) Python MLlib API missing items: Statistics

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6262:
---

Assignee: Apache Spark  (was: Kai Sasaki)

> Python MLlib API missing items: Statistics
> --
>
> Key: SPARK-6262
> URL: https://issues.apache.org/jira/browse/SPARK-6262
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> MultivariateStatisticalSummary
> * normL1
> * normL2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6242) Support replace (drop) column for parquet table

2015-04-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6242.
---
Resolution: Duplicate

> Support replace (drop) column for parquet table
> ---
>
> Key: SPARK-6242
> URL: https://issues.apache.org/jira/browse/SPARK-6242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: chirag aggarwal
>Assignee: Cheng Lian
>
> SPARK-5528 provides a easy way of support for add column to parquet tables. 
> This is done by using the native parquet capability of merging the schema 
> from all the part-files and _common_metadata files.
> But, if someone wants to drop a column from the parquet table, this still 
> does not work. This happens because, the merged schema shall still show the 
> dropped column, but the column is no more there in metastore. So, the 
> schema's obtained from the two sources do not match, and hence any subsequent 
> query on this table fails.
> Instead of checking for exact match between the two schemas, spark should 
> only check if the schema obtained from metastore is subset of parquet merged 
> schema or not. If this check passes, then the columns present in metastore 
> should be allowed to be referred in the query.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-04-04 Thread Kai Sasaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-4036:
--
Attachment: CRF_design.1.pdf

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-04-04 Thread Harut Martirosyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395645#comment-14395645
 ] 

Harut Martirosyan commented on SPARK-3928:
--

This stopped working after refactoring/merge.

> Support wildcard matches on Parquet files
> -
>
> Key: SPARK-3928
> URL: https://issues.apache.org/jira/browse/SPARK-3928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 1.3.0
>
>
> {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
> {{2014-\?\?-\?\?}}. 
> It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395643#comment-14395643
 ] 

Apache Spark commented on SPARK-6489:
-

User 'dreamquster' has created a pull request for this issue:
https://github.com/apache/spark/pull/5358

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Xi Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395642#comment-14395642
 ] 

Xi Shen commented on SPARK-6706:


Yes, the {{collect()}} jobs finished, then hangs at the driver. Your words are 
more accurate.

But I don't observe this behavior with the *random initialization* of k-means. 
I think it is because the *kmeans||* algorithm has a more complex initialize 
algorithm.

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm finished some {{collect()}} jobs, then the *driver* hangs 
> for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Xi Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395641#comment-14395641
 ] 

Xi Shen commented on SPARK-6706:


Yes, the {{collect()}} jobs finished, then hangs at the driver. Your words are 
more accurate.

But I don't observe this behavior with the *random initialization* of k-means. 
I think it is because the *kmeans||* algorithm has a more complex initialize 
algorithm.

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm finished some {{collect()}} jobs, then the *driver* hangs 
> for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Xi Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Shen updated SPARK-6706:
---
Comment: was deleted

(was: Yes, the {{collect()}} jobs finished, then hangs at the driver. Your 
words are more accurate.

But I don't observe this behavior with the *random initialization* of k-means. 
I think it is because the *kmeans||* algorithm has a more complex initialize 
algorithm.)

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm finished some {{collect()}} jobs, then the *driver* hangs 
> for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Xi Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Shen updated SPARK-6706:
---
Description: 
When doing k-means cluster with the "kmeans||" algorithm which is the default 
one. The algorithm finished some {{collect()}} jobs, then the *driver* hangs 
for a long time.

Settings:

- k above 100
- feature dimension about 360
- total data size is about 100 MB

The issue was first noticed with Spark 1.2.1. I tested with both local and 
cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
mode. **However, I do not have a 1.3.0 cluster environment for me to test.**

  was:
When doing k-means cluster with the "kmeans||" algorithm which is the default 
one. The algorithm hangs at some "collect" step for a long time.

Settings:

- k above 100
- feature dimension about 360
- total data size is about 100 MB

The issue was first noticed with Spark 1.2.1. I tested with both local and 
cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
mode. **However, I do not have a 1.3.0 cluster environment for me to test.**


> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm finished some {{collect()}} jobs, then the *driver* hangs 
> for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395639#comment-14395639
 ] 

Sean Owen commented on SPARK-6706:
--

I tried your code locally vs master with k=1000 (you say >100, but it works at 
500, so I tried 1000), which you can do by building Spark and running the 
shell. I don't see it stuck in any {{collect()}} stage; those complete quickly. 
But, the driver does bog down for a long long time in {{LocalKMeans}}:

{code}
at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:121)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:104)
at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:311)
at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:522)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:496)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:490)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.GenSeqViewLike$Sliced$class.foreach(GenSeqViewLike.scala:42)
at 
scala.collection.mutable.IndexedSeqView$$anon$2.foreach(IndexedSeqView.scala:80)
at 
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:490)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:513)
at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:53)
at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:52)
at 
scala.collection.GenTraversableViewLike$Mapped$$anonfun$foreach$2.apply(GenTraversableViewLike.scala:81)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at 
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at 
scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80)
at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at 
scala.collection.SeqViewLike$AbstractTransformed.foldLeft(SeqViewLike.scala:43)
at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203)
at 
scala.collection.SeqViewLike$AbstractTransformed.sum(SeqViewLike.scala:43)
at 
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:54)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:396)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:393)
{code}

I think this is what Derrick was getting at in SPARK-3220, that this bit 
doesn't scale.

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm hangs at some "collect" step for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions

2015-04-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395625#comment-14395625
 ] 

Apache Spark commented on SPARK-6548:
-

User 'dreamquster' has created a pull request for this issue:
https://github.com/apache/spark/pull/5357

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Xi Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395622#comment-14395622
 ] 

Xi Shen commented on SPARK-6706:


I know it is more like a user report than a technical report. But I am not 
familiar with Spark code, and I am currently busy with my study. I am happy to 
look deep into this issue, but it may not happen very soon.

As for your question **It's not clear whether you know it is stuck or simply 
still executing**. I can confirm it is still executing. I observe one of my CPU 
is constantly busy a Java process, and if the *k* value is not very large, say 
500, the job can finish after a long time.

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm hangs at some "collect" step for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6646.

Resolution: Later

Alright -- given the size of the task, I am not sure if I have enough cycle to 
do it at the moment. Let's revisit next year April 1st.


> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395614#comment-14395614
 ] 

Sean Owen commented on SPARK-6646:
--

I feel like people aren't taking this seriously. What do you think this is, 
some kind of joke?




_OK can we resolve this one? :) _

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2015-04-04 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395613#comment-14395613
 ] 

Guoqiang Li commented on SPARK-3625:


Sometimes, when calling the   RDD.checkpoint   , we cannot determine it before 
any job has been
executed on this RDD. Just like 
[PeriodicGraphCheckpointer|https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala]
 

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}
> This limit is too strict , This makes it difficult to implement SPARK-3623 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395609#comment-14395609
 ] 

Sean Owen commented on SPARK-6435:
--

I feel like it's worth fixing, cryptic or not. Try reproducing with the quotes? 
and if it fails, let's go with the PR? it seems reasonable and is confined to 
the Windows support.

> spark-shell --jars option does not add all jars to classpath
> 
>
> Key: SPARK-6435
> URL: https://issues.apache.org/jira/browse/SPARK-6435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 1.3.0
> Environment: Win64
>Reporter: vijay
>
> Not all jars supplied via the --jars option will be added to the driver (and 
> presumably executor) classpath.  The first jar(s) will be added, but not all.
> To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
> then try to import a class from the last jar.  This fails.  A simple 
> reproducer: 
> Create a bunch of dummy jars:
> jar cfM jar1.jar log.txt
> jar cfM jar2.jar log.txt
> jar cfM jar3.jar log.txt
> jar cfM jar4.jar log.txt
> Start the spark-shell with the dummy jars and guava at the end:
> %SPARK_HOME%\bin\spark-shell --master local --jars 
> jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
> In the shell, try importing from guava; you'll get an error:
> {code}
> scala> import com.google.common.base.Strings
> :19: error: object Strings is not a member of package 
> com.google.common.base
>import com.google.common.base.Strings
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395599#comment-14395599
 ] 

Sean Owen commented on SPARK-3625:
--

Why would this change be necessary in order to use checkpointing? see the 
discussion above.

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}
> This limit is too strict , This makes it difficult to implement SPARK-3623 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6706) kmeans|| hangs for a long time if both k and vector dimension are large

2015-04-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395598#comment-14395598
 ] 

Sean Owen commented on SPARK-6706:
--

Hey Xi, I don't think this makes for a good JIRA as it does not sound like 
you've investigated what is happening. For example you can look at exactly what 
step is executing in the UI, and look at the source to know what is being 
computed. It's not clear whether you know it is stuck or simply still 
executing, or whether it's your RDDs that are being computed. Typically it's 
best to reproduce it locally vs master if at all possible. Although providing 
code is good, the whole code dump doesn't narrow it down.

> kmeans|| hangs for a long time if both k and vector dimension are large
> ---
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
>Reporter: Xi Shen
>Assignee: Xiangrui Meng
>  Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default 
> one. The algorithm hangs at some "collect" step for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and 
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local 
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5242) "ec2/spark_ec2.py lauch" does not work with VPC if no public DNS or IP is available

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5242:
---

Assignee: Apache Spark

> "ec2/spark_ec2.py lauch" does not work with VPC if no public DNS or IP is 
> available
> ---
>
> Key: SPARK-5242
> URL: https://issues.apache.org/jira/browse/SPARK-5242
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Vladimir Grigor
>Assignee: Apache Spark
>  Labels: easyfix
>
> How to reproduce: user starting cluster in VPC needs to wait forever:
> {code}
> ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
> --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
> --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
> Setting up security groups...
> Searching for existing cluster SparkByScript...
> Spark AMI: ami-1ae0166d
> Launching instances...
> Launched 1 slaves in eu-west-1a, regid = r-e70c5502
> Launched master in eu-west-1a, regid = r-bf0f565a
> Waiting for cluster to enter 'ssh-ready' state..{forever}
> {code}
> Problem is that current code makes wrong assumption that VPC instance has 
> public_dns_name or public ip_address. Actually more common is that VPC 
> instance has only private_ip_address.
> The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5242) "ec2/spark_ec2.py lauch" does not work with VPC if no public DNS or IP is available

2015-04-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5242:
---

Assignee: (was: Apache Spark)

> "ec2/spark_ec2.py lauch" does not work with VPC if no public DNS or IP is 
> available
> ---
>
> Key: SPARK-5242
> URL: https://issues.apache.org/jira/browse/SPARK-5242
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Vladimir Grigor
>  Labels: easyfix
>
> How to reproduce: user starting cluster in VPC needs to wait forever:
> {code}
> ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
> --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
> --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
> Setting up security groups...
> Searching for existing cluster SparkByScript...
> Spark AMI: ami-1ae0166d
> Launching instances...
> Launched 1 slaves in eu-west-1a, regid = r-e70c5502
> Launched master in eu-west-1a, regid = r-bf0f565a
> Waiting for cluster to enter 'ssh-ready' state..{forever}
> {code}
> Problem is that current code makes wrong assumption that VPC instance has 
> public_dns_name or public ip_address. Actually more common is that VPC 
> instance has only private_ip_address.
> The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes

2015-04-04 Thread Ankur Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Chauhan updated SPARK-6707:
-
Description: 
Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources when 
trying to determine the usablility of a resource offer from a mesos slave node. 
It may be preferable for the user to be able to ensure that the spark jobs are 
only started on a certain set of nodes (based on attributes). 

For example, If the user sets a property, let's say 
{code}spark.mesos.constraints{code} is set to 
{code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
checked to see if they meet both these constraints and only then will be 
accepted to start new executors.

  was:
Currently, the mesos scheduler only looks at the `cpu` and `mem` resources when 
trying to determine the usablility of a resource offer from a mesos slave node. 
It may be preferable for the user to be able to ensure that the spark jobs are 
only started on a certain set of nodes (based on attributes). 

For example, If the user sets a property, let's say 
{code}spark.mesos.constraints{code} is set to 
{code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
checked to see if they meet both these constraints and only then will be 
accepted to start new executors.


> Mesos Scheduler should allow the user to specify constraints based on slave 
> attributes
> --
>
> Key: SPARK-6707
> URL: https://issues.apache.org/jira/browse/SPARK-6707
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 1.3.0
>Reporter: Ankur Chauhan
>  Labels: mesos, scheduler
>
> Currently, the mesos scheduler only looks at the 'cpu' and 'mem' resources 
> when trying to determine the usablility of a resource offer from a mesos 
> slave node. It may be preferable for the user to be able to ensure that the 
> spark jobs are only started on a certain set of nodes (based on attributes). 
> For example, If the user sets a property, let's say 
> {code}spark.mesos.constraints{code} is set to 
> {code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
> checked to see if they meet both these constraints and only then will be 
> accepted to start new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes

2015-04-04 Thread Ankur Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Chauhan updated SPARK-6707:
-
Description: 
Currently, the mesos scheduler only looks at the `cpu` and `mem` resources when 
trying to determine the usablility of a resource offer from a mesos slave node. 
It may be preferable for the user to be able to ensure that the spark jobs are 
only started on a certain set of nodes (based on attributes). 

For example, If the user sets a property, let's say 
{code}spark.mesos.constraints{code} is set to 
{code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
checked to see if they meet both these constraints and only then will be 
accepted to start new executors.

  was:
Currently, the mesos scheduler only looks at the `cpu` and `mem` resources when 
trying to determine the usablility of a resource offer from a mesos slave node. 
It may be preferable for the user to be able to ensure that the spark jobs are 
only started on a certain set of nodes (based on attributes). 

For example, If the user sets a property, let's say `spark.mesos.constraints` 
is set to `tachyon=true;us-east-1=false`, then the resource offers will be 
checked to see if they meet both these constraints and only then will be 
accepted to start new executors.


> Mesos Scheduler should allow the user to specify constraints based on slave 
> attributes
> --
>
> Key: SPARK-6707
> URL: https://issues.apache.org/jira/browse/SPARK-6707
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 1.3.0
>Reporter: Ankur Chauhan
>  Labels: mesos, scheduler
>
> Currently, the mesos scheduler only looks at the `cpu` and `mem` resources 
> when trying to determine the usablility of a resource offer from a mesos 
> slave node. It may be preferable for the user to be able to ensure that the 
> spark jobs are only started on a certain set of nodes (based on attributes). 
> For example, If the user sets a property, let's say 
> {code}spark.mesos.constraints{code} is set to 
> {code}tachyon=true;us-east-1=false{code}, then the resource offers will be 
> checked to see if they meet both these constraints and only then will be 
> accepted to start new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6707) Mesos Scheduler should allow the user to specify constraints based on slave attributes

2015-04-04 Thread Ankur Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Chauhan updated SPARK-6707:
-
Description: 
Currently, the mesos scheduler only looks at the `cpu` and `mem` resources when 
trying to determine the usablility of a resource offer from a mesos slave node. 
It may be preferable for the user to be able to ensure that the spark jobs are 
only started on a certain set of nodes (based on attributes). 

For example, If the user sets a property, let's say `spark.mesos.constraints` 
is set to `tachyon=true;us-east-1=false`, then the resource offers will be 
checked to see if they meet both these constraints and only then will be 
accepted to start new executors.

> Mesos Scheduler should allow the user to specify constraints based on slave 
> attributes
> --
>
> Key: SPARK-6707
> URL: https://issues.apache.org/jira/browse/SPARK-6707
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 1.3.0
>Reporter: Ankur Chauhan
>  Labels: mesos, scheduler
>
> Currently, the mesos scheduler only looks at the `cpu` and `mem` resources 
> when trying to determine the usablility of a resource offer from a mesos 
> slave node. It may be preferable for the user to be able to ensure that the 
> spark jobs are only started on a certain set of nodes (based on attributes). 
> For example, If the user sets a property, let's say `spark.mesos.constraints` 
> is set to `tachyon=true;us-east-1=false`, then the resource offers will be 
> checked to see if they meet both these constraints and only then will be 
> accepted to start new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

69 matches

Mail list logo