[jira] [Updated] (SPARK-1681) Handle hive support correctly in ./make-distribution

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1681:
---

Description: 
When Hive support is enabled we should copy the datanucleus jars to the 
packaged distribution. The simplest way would be to create a lib_managed folder 
in the final distribution so that the compute-classpath script searches in 
exactly the same way whether or not it's a release.

A slightly nicer solution is to put the jars inside of `/lib` and have some 
fancier check for the jar location in the compute-classpath script.

We should also document how to run Spark SQL on YARN when hive support is 
enabled. In particular how to add the necessary jars to spark-submit.

  was:
When Hive support is enabled we should copy the datanucleus jars to the 
packaged distribution. The simplest way would be to create a lib_managed folder 
in the final distribution so that the compute-classpath script searches in 
exactly the same way whether or not it's a release.

A slightly nicer solution is to put the jars inside of `/lib` and have some 
fancier check for the jar location in the compute-classpath script.


> Handle hive support correctly in ./make-distribution
> 
>
> Key: SPARK-1681
> URL: https://issues.apache.org/jira/browse/SPARK-1681
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When Hive support is enabled we should copy the datanucleus jars to the 
> packaged distribution. The simplest way would be to create a lib_managed 
> folder in the final distribution so that the compute-classpath script 
> searches in exactly the same way whether or not it's a release.
> A slightly nicer solution is to put the jars inside of `/lib` and have some 
> fancier check for the jar location in the compute-classpath script.
> We should also document how to run Spark SQL on YARN when hive support is 
> enabled. In particular how to add the necessary jars to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1681) Handle hive support correctly in ./make-distribution

2014-04-29 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1681:
--

 Summary: Handle hive support correctly in ./make-distribution
 Key: SPARK-1681
 URL: https://issues.apache.org/jira/browse/SPARK-1681
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


When Hive support is enabled we should copy the datanucleus jars to the 
packaged distribution. The simplest way would be to create a lib_managed folder 
in the final distribution so that the compute-classpath script searches in 
exactly the same way whether or not it's a release.

A slightly nicer solution is to put the jars inside of `/lib` and have some 
fancier check for the jar location in the compute-classpath script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf

2014-04-29 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1680:
--

 Summary: Clean up use of setExecutorEnvs in SparkConf 
 Key: SPARK-1680
 URL: https://issues.apache.org/jira/browse/SPARK-1680
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


This was added in 0.9.0 but the config change removed propagation of these 
values to executors. We should make one of two decisions:

1. Don't allow setting arbitrary executor envs in standalone mode.
2. Document this option, respect the env variables when launching a job, and 
consolidate this with SPARK_YARN_USER_ENV.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985219#comment-13985219
 ] 

Patrick Wendell commented on SPARK-922:
---

This is no longer a blocker now that we've downgraded the python dependency, 
but would still be nice to have.

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-922:
--

Priority: Major  (was: Blocker)

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-922:
--

Fix Version/s: (was: 1.0.0)
   1.1.0

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1466) Pyspark doesn't check if gateway process launches correctly

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1466:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

> Pyspark doesn't check if gateway process launches correctly
> ---
>
> Key: SPARK-1466
> URL: https://issues.apache.org/jira/browse/SPARK-1466
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Blocker
> Fix For: 1.0.1
>
>
> If the gateway process fails to start correctly (e.g., because JAVA_HOME 
> isn't set correctly, there's no Spark jar, etc.), right now pyspark fails 
> because of a very difficult-to-understand error, where we try to parse stdout 
> to get the port where Spark started and there's nothing there.  We should 
> properly catch the error, print it to the user, and exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1626) Update Spark YARN docs to use spark-submit

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1626.


Resolution: Duplicate

> Update Spark YARN docs to use spark-submit
> --
>
> Key: SPARK-1626
> URL: https://issues.apache.org/jira/browse/SPARK-1626
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1492:
---

Priority: Blocker  (was: Major)

> running-on-yarn doc should use spark-submit script for examples
> ---
>
> Key: SPARK-1492
> URL: https://issues.apache.org/jira/browse/SPARK-1492
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Sandy Ryza
>Priority: Blocker
>
> the spark-class script puts out lots of warnings telling users to use 
> spark-submit script with new options.  We should update the 
> running-on-yarn.md docs to have examples using the spark-submit script rather 
> then spark-class. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1626) Update Spark YARN docs to use spark-submit

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1626:
---

Assignee: Sandy Ryza  (was: Patrick Wendell)

> Update Spark YARN docs to use spark-submit
> --
>
> Key: SPARK-1626
> URL: https://issues.apache.org/jira/browse/SPARK-1626
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1004) PySpark on YARN

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1004.


Resolution: Fixed

Issue resolved by pull request 30
[https://github.com/apache/spark/pull/30]

> PySpark on YARN
> ---
>
> Key: SPARK-1004
> URL: https://issues.apache.org/jira/browse/SPARK-1004
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Josh Rosen
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This is for tracking progress on supporting YARN in PySpark.
> We might be able to use {{yarn-client}} mode 
> (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1679) In-Memory compression needs to be configurable.

2014-04-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1679:
---

 Summary: In-Memory compression needs to be configurable.
 Key: SPARK-1679
 URL: https://issues.apache.org/jira/browse/SPARK-1679
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.0.0


Since we are still finding bugs in the compression code I think we should make 
it configurable in SparkConf and turn it off by default for the 1.0 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1678) Compression loses repeated values.

2014-04-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1678:
---

 Summary: Compression loses repeated values.
 Key: SPARK-1678
 URL: https://issues.apache.org/jira/browse/SPARK-1678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.0.0


Here's a test case:

{code}
  test("all the same strings") {
sparkContext.parallelize(1 to 1000).map(_ => 
StringData("test")).registerAsTable("test1000")
assert(sql("SELECT * FROM test1000").count() === 1000)
cacheTable("test1000")
assert(sql("SELECT * FROM test1000").count() === 1000)
  }
{code}

First assert passes, second one fails.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1677) Allow users to avoid Hadoop output checks if desired

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1677:
---

Issue Type: Improvement  (was: Bug)

> Allow users to avoid Hadoop output checks if desired
> 
>
> Key: SPARK-1677
> URL: https://issues.apache.org/jira/browse/SPARK-1677
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> For compatibility with older versions of Spark it would be nice to have an 
> option `spark.hadoop.validateOutputSpecs` and a description "If set to true, 
> validates the output specification used in saveAsHadoopFile and other 
> variants. This can be disabled to silence exceptions due to pre-existing 
> output directories."
> This would just wrap the checking done in this PR:
> https://issues.apache.org/jira/browse/SPARK-1100
> https://github.com/apache/spark/pull/11
> By first checking the spark conf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1677) Allow users to avoid Hadoop output checks if desired

2014-04-29 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1677:
--

 Summary: Allow users to avoid Hadoop output checks if desired
 Key: SPARK-1677
 URL: https://issues.apache.org/jira/browse/SPARK-1677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Patrick Wendell


For compatibility with older versions of Spark it would be nice to have an 
option `spark.hadoop.validateOutputSpecs` and a description "If set to true, 
validates the output specification used in saveAsHadoopFile and other variants. 
This can be disabled to silence exceptions due to pre-existing output 
directories."

This would just wrap the checking done in this PR:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

By first checking the spark conf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1677) Allow users to avoid Hadoop output checks if desired

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1677:
---

Description: 
For compatibility with older versions of Spark it would be nice to have an 
option `spark.hadoop.validateOutputSpecs` (default true) and a description "If 
set to true, validates the output specification used in saveAsHadoopFile and 
other variants. This can be disabled to silence exceptions due to pre-existing 
output directories."

This would just wrap the checking done in this PR:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

By first checking the spark conf.

  was:
For compatibility with older versions of Spark it would be nice to have an 
option `spark.hadoop.validateOutputSpecs` and a description "If set to true, 
validates the output specification used in saveAsHadoopFile and other variants. 
This can be disabled to silence exceptions due to pre-existing output 
directories."

This would just wrap the checking done in this PR:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/pull/11

By first checking the spark conf.


> Allow users to avoid Hadoop output checks if desired
> 
>
> Key: SPARK-1677
> URL: https://issues.apache.org/jira/browse/SPARK-1677
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> For compatibility with older versions of Spark it would be nice to have an 
> option `spark.hadoop.validateOutputSpecs` (default true) and a description 
> "If set to true, validates the output specification used in saveAsHadoopFile 
> and other variants. This can be disabled to silence exceptions due to 
> pre-existing output directories."
> This would just wrap the checking done in this PR:
> https://issues.apache.org/jira/browse/SPARK-1100
> https://github.com/apache/spark/pull/11
> By first checking the spark conf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1676) HDFS FileSystems continually pile up in the FS cache

2014-04-29 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1676:
-

 Summary: HDFS FileSystems continually pile up in the FS cache
 Key: SPARK-1676
 URL: https://issues.apache.org/jira/browse/SPARK-1676
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 0.9.1
Reporter: Aaron Davidson
Priority: Critical


Due to HDFS-3545, FileSystem.get() always produces (and caches) a new 
FileSystem when provided with a new UserGroupInformation (UGI), even if the UGI 
represents the same user as another UGI. This causes a buildup of FileSystem 
objects at an alarming rate, often one per task for something like 
sc.textFile(). The bug is especially hard-hitting for NativeS3FileSystem, which 
also maintains an open connection to S3, clogging up the system file handles.

The bug was introduced in https://github.com/apache/spark/pull/29, where doAs 
was made the default behavior.

A fix is not forthcoming for the general case, as UGIs do not cache well, but 
this problem can lead to spark clusters entering into a failed state and 
requiring executors be restarted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1675) Make clear whether computePrincipalComponents centers data

2014-04-29 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1675:
-

 Summary: Make clear whether computePrincipalComponents centers data
 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1661) the result of querying table created with RegexSerDe is all null

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1661.
-

Resolution: Won't Fix

> the result of querying table created with RegexSerDe is all null
> 
>
> Key: SPARK-1661
> URL: https://issues.apache.org/jira/browse/SPARK-1661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 0.9.0
> Environment: linux 2.6.32-358.el6.x86_64,Hive 12.0,shark 0.9.0,Hadoop 
> 2.2.0
>Reporter: likunjian
>  Labels: HQL, hadoop, hive, regex, shark
> Attachments: log.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> the result of querying table created with RegexSerDe is all null
> when i query the table created with 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe by shark,the columns in the 
> result is all null
> select * from access_log where logdate='2014-04-28' limit 10;
> OK
> ip  hosttimemethod  request protocolstatus  size
> referer cookieuid   requesttime session httpxrequestedwith  agent 
>   upstreamresponsetimelogdate
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> Time taken: 4.362 seconds
> my regex is
>  ^([^ ]*) [^ ]* ([^ ]*) \\[([^\]]*)\\] \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) 
> (-|[0-9]*) \"(\.\+\?|-)\" ([^ ]*) ([^ ]*) ([^ ]*) \"(\.\+\?|-)\" 
> \"(\.\+\?|-)\" \"(\.\+\?|-)\"$
> nginx log example:
> 42.49.44.61 - www..comm [20/Apr/2014:23:58:03 +0800] "GET /x/296837 
> HTTP/1.1" 200 3871 "http://www.x.com/x/296837"; - 0.015 
> 63hbb4om2cvtjs0f7d969n1uf4 "com.x.browser" "Mozilla/5.0 (Linux; U; x 
> 4.1.2; zh-cn; ZTE N919 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) 
> Version/4.0 Mobile Safari/534.30" "0.015"
> 111.121.176.149 - www..comm [20/Apr/2014:23:58:03 +0800] "GET 
> /x/264904 HTTP/1.1" 200 3827 
> "http://m.baidu.com/s?from=2001a&bd_page_type=1&word=%E8%8E%B2%E8%97%95%E6%80%8E%E6%A0%B7%E5%8D%A4%E6%89%8D%E5%A5%BD%E5%90%83";
>  - 0.015 ft7tr4b06b23ub9lnugdf4gcq3 "-" "Mozilla/5.0 (Linux; U; x 4.1.2; 
> zh-CN; 8190Q Build/JZO54K) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 
> UCBrowser/9.5.2.394 U3/0.8.0 Mobile Safari/533.1" "0.015"
> 222.209.97.169 - www..comm [20/Apr/2014:23:58:04 +0800] "GET / HTTP/1.1" 
> 200 3188 "http://m.idea123.cn/food.html"; - 0.014 - "-" "Lenovo S890/S100 
> Linux/3.0.13 x/4.0.3 Release/12.12.2011 Browser/AppleWebKit534.30 
> Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30" "0.014"
> 59.36.84.241 - www..comm [20/Apr/2014:23:58:05 +0800] "GET 
> /app/x/topic/view.php?id=138555 HTTP/1.1" 200 3151 "-" - 0.009 - "-" 
> "Mozilla/5.0 (Linux; U; x 2.3.7; zh-cn; TD500 Build/GWK74) 
> AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" 
> "0.009"
> 113.242.39.81 - www..comm [20/Apr/2014:23:58:07 +0800] "GET /x/419691 
> HTTP/1.1" 200 4174 "http://www..comm/x/all/308?p=3"; - 0.013 
> 1n579ukg1gho7i7mr3q8ic8j97 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 
> 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 
> Safari/530.17; 360browser(securitypay,securityinstalled); 
> 360(x,uppayplugin); 360 Aphone Browser (5.3.1)" "0.013"
> Very strange, I execute a query in Hive is normal. I really do not 
> understand. . .  :-(
> OK
> ip  hosttimemethod  request protocolstatus  size
> referer cookieuid   requesttime session httpxrequestedw

[jira] [Commented] (SPARK-1661) the result of querying table created with RegexSerDe is all null

2014-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985055#comment-13985055
 ] 

Michael Armbrust commented on SPARK-1661:
-

Thanks for your report.  This JIRA is for reporting bugs with Spark and its 
components.  Shark is a separate project and issues with older versions of 
Shark should probably be filed on the Shark issue tracker.

However, I did add a test to make sure the RegexSerDe was working with Spark 
SQL (which is a nearly from scratch rewrite of Shark, that will be included in 
the 1.0 release of Spark as an Alpha component).  If you find you are still 
having problems with Spark SQL, please reopen this issue.

New spark tests: https://github.com/apache/spark/pull/595

> the result of querying table created with RegexSerDe is all null
> 
>
> Key: SPARK-1661
> URL: https://issues.apache.org/jira/browse/SPARK-1661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 0.9.0
> Environment: linux 2.6.32-358.el6.x86_64,Hive 12.0,shark 0.9.0,Hadoop 
> 2.2.0
>Reporter: likunjian
>  Labels: HQL, hadoop, hive, regex, shark
> Attachments: log.txt
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> the result of querying table created with RegexSerDe is all null
> when i query the table created with 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe by shark,the columns in the 
> result is all null
> select * from access_log where logdate='2014-04-28' limit 10;
> OK
> ip  hosttimemethod  request protocolstatus  size
> referer cookieuid   requesttime session httpxrequestedwith  agent 
>   upstreamresponsetimelogdate
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL  
>   NULLNULLNULLNULLNULL2014-04-28
> Time taken: 4.362 seconds
> my regex is
>  ^([^ ]*) [^ ]* ([^ ]*) \\[([^\]]*)\\] \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) 
> (-|[0-9]*) \"(\.\+\?|-)\" ([^ ]*) ([^ ]*) ([^ ]*) \"(\.\+\?|-)\" 
> \"(\.\+\?|-)\" \"(\.\+\?|-)\"$
> nginx log example:
> 42.49.44.61 - www..comm [20/Apr/2014:23:58:03 +0800] "GET /x/296837 
> HTTP/1.1" 200 3871 "http://www.x.com/x/296837"; - 0.015 
> 63hbb4om2cvtjs0f7d969n1uf4 "com.x.browser" "Mozilla/5.0 (Linux; U; x 
> 4.1.2; zh-cn; ZTE N919 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) 
> Version/4.0 Mobile Safari/534.30" "0.015"
> 111.121.176.149 - www..comm [20/Apr/2014:23:58:03 +0800] "GET 
> /x/264904 HTTP/1.1" 200 3827 
> "http://m.baidu.com/s?from=2001a&bd_page_type=1&word=%E8%8E%B2%E8%97%95%E6%80%8E%E6%A0%B7%E5%8D%A4%E6%89%8D%E5%A5%BD%E5%90%83";
>  - 0.015 ft7tr4b06b23ub9lnugdf4gcq3 "-" "Mozilla/5.0 (Linux; U; x 4.1.2; 
> zh-CN; 8190Q Build/JZO54K) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 
> UCBrowser/9.5.2.394 U3/0.8.0 Mobile Safari/533.1" "0.015"
> 222.209.97.169 - www..comm [20/Apr/2014:23:58:04 +0800] "GET / HTTP/1.1" 
> 200 3188 "http://m.idea123.cn/food.html"; - 0.014 - "-" "Lenovo S890/S100 
> Linux/3.0.13 x/4.0.3 Release/12.12.2011 Browser/AppleWebKit534.30 
> Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30" "0.014"
> 59.36.84.241 - www..comm [20/Apr/2014:23:58:05 +0800] "GET 
> /app/x/topic/view.php?id=138555 HTTP/1.1" 200 3151 "-" - 0.009 - "-" 
> "Mozilla/5.0 (Linux; U; x 2.3.7; zh-cn; TD500 Build/GWK74) 
> AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" 
> "0.009"
> 113.242.39.81 - www..comm [20/Apr/2014:23:58:07 +0800] "GET 

[jira] [Resolved] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe

2014-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1674.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

> Interrupted system call error in pyspark's RDD.pipe
> ---
>
> Key: SPARK-1674
> URL: https://issues.apache.org/jira/browse/SPARK-1674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>
> RDD.pipe's doctest throws interrupted system call exception on Mac. It can be 
> fixed by wrapping pipe.stdout.readline in an iterator.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-544) Provide a Configuration class in addition to system properties

2014-04-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-544.
-

   Resolution: Fixed
Fix Version/s: 0.9.0

> Provide a Configuration class in addition to system properties
> --
>
> Key: SPARK-544
> URL: https://issues.apache.org/jira/browse/SPARK-544
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 0.9.0
>
>
> This is a much better option for people who want to connect to multiple Spark 
> clusters in the same program, and for unit tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-544) Provide a Configuration class in addition to system properties

2014-04-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-544:
---

Assignee: Matei Zaharia  (was: Evan Chan)

> Provide a Configuration class in addition to system properties
> --
>
> Key: SPARK-544
> URL: https://issues.apache.org/jira/browse/SPARK-544
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 0.9.0
>
>
> This is a much better option for people who want to connect to multiple Spark 
> clusters in the same program, and for unit tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1268) Adding XOR and AND-NOT operations to spark.util.collection.BitSet

2014-04-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1268.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

> Adding XOR and AND-NOT operations to spark.util.collection.BitSet
> -
>
> Key: SPARK-1268
> URL: https://issues.apache.org/jira/browse/SPARK-1268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Petko Nikolov
>Priority: Minor
>  Labels: starter
> Fix For: 1.0.0
>
>
> BitSet collection is missing some important bit-wise operations. Symmetric 
> difference (xor) in particular is useful for computing some distance metrics 
> (e.g. Hamming). Difference (and-not) as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-615) Add mapPartitionsWithIndex() to the Java API

2014-04-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-615.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

> Add mapPartitionsWithIndex() to the Java API
> 
>
> Key: SPARK-615
> URL: https://issues.apache.org/jira/browse/SPARK-615
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 0.6.0
>Reporter: Josh Rosen
>Assignee: Holden Karau
>Priority: Minor
>  Labels: Starter
> Fix For: 1.0.0
>
>
> We should add {{mapPartitionsWithIndex()}} to the Java API.
> What should the interface for this look like?  We could require the user to 
> pass in a {{FlatMapFunction[(Int, Iterator[T]))}}, but this requires them to 
> unpack the tuple from Java.  It would be nice if the UDF had a signature like 
> {{f(int partition, Iterator[T] iterator)}}, but this will require defining a 
> new set of {{Function}} classes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-29 Thread Vlad Frolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985007#comment-13985007
 ] 

Vlad Frolov commented on SPARK-1394:


[~idanzalz] unfortunately, it had helped to avoid only one exception, so I 
commented signal binding in PySpark and these crashes went away. I hope it will 
be fixed somehow in next Spark release.

> calling system.platform on worker raises IOError
> 
>
> Key: SPARK-1394
> URL: https://issues.apache.org/jira/browse/SPARK-1394
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
> Environment: Tested on Ubuntu and Linux, local and remote master, 
> python 2.7.*
>Reporter: Idan Zalzberg
>  Labels: pyspark
>
> A simple program that calls system.platform() on the worker fails most of the 
> time (it works some times but very rarely).
> This is critical since many libraries call that method (e.g. boto).
> Here is the trace of the attempt to call that method:
> $ /usr/local/spark/bin/pyspark
> Python 2.7.3 (default, Feb 27 2014, 20:00:17)
> [GCC 4.6.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
> address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
> 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
> 14/04/02 18:18:38 INFO Remoting: Starting remoting
> 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@10.33.102.46:36640]
> 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
> 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140402181839-919f
> 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
> MB.
> 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
> = ConnectionManagerId(10.33.102.46,43357)
> 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
> 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
> block manager 10.33.102.46:43357 with 294.6 MB RAM
> 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
> http://10.33.102.46:51803
> 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
> 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
> 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
> 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
> http://10.33.102.46:4040
> 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
>   /_/
> Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
> Spark context available as sc.
> >>> import platform
> >>> sc.parallelize([1]).map(lambda x : platform.system()).collect()
> 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1
> 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 
> output partitions (allowLocal=false)
> 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
> :1)
> 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
> collect at :1), which has no missing parents
> 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
> (PythonRDD[1] at collect at :1)
> 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
> executor localhost: localhost (PROCESS_LOCAL)
> 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
> 12 ms
> 14/04/02 18:19:17 INFO Executor: Running task ID 0
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.seriali

[jira] [Commented] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe

2014-04-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985000#comment-13985000
 ] 

Xiangrui Meng commented on SPARK-1674:
--

PR: https://github.com/apache/spark/pull/594

> Interrupted system call error in pyspark's RDD.pipe
> ---
>
> Key: SPARK-1674
> URL: https://issues.apache.org/jira/browse/SPARK-1674
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> RDD.pipe's doctest throws interrupted system call exception on Mac. It can be 
> fixed by wrapping pipe.stdout.readline in an iterator.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1373) Compression for In-Memory Columnar storage

2014-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984997#comment-13984997
 ] 

Michael Armbrust commented on SPARK-1373:
-

Note that the code is in a large part adapted from Shark, including: 
https://github.com/amplab/shark/blob/master/src/test/scala/shark/memstore2/column/CompressionAlgorithmSuite.scala

> Compression for In-Memory Columnar storage
> --
>
> Key: SPARK-1373
> URL: https://issues.apache.org/jira/browse/SPARK-1373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe

2014-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1674:


 Summary: Interrupted system call error in pyspark's RDD.pipe
 Key: SPARK-1674
 URL: https://issues.apache.org/jira/browse/SPARK-1674
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


RDD.pipe's doctest throws interrupted system call exception on Mac. It can be 
fixed by wrapping pipe.stdout.readline in an iterator.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1673) GLMNET implementation in Spark

2014-04-29 Thread Sung Chung (JIRA)
Sung Chung created SPARK-1673:
-

 Summary: GLMNET implementation in Spark
 Key: SPARK-1673
 URL: https://issues.apache.org/jira/browse/SPARK-1673
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Sung Chung


This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, Rob 
Tibshirani.

http://www.jstatsoft.org/v33/i01/paper

It's a straightforward implementation of the Coordinate-Descent based L1/L2 
regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1672) Support separate partitioners (and numbers of partitions) for users and products

2014-04-29 Thread Tor Myklebust (JIRA)
Tor Myklebust created SPARK-1672:


 Summary: Support separate partitioners (and numbers of partitions) 
for users and products
 Key: SPARK-1672
 URL: https://issues.apache.org/jira/browse/SPARK-1672
 Project: Spark
  Issue Type: Improvement
Reporter: Tor Myklebust
Priority: Minor


The user ought to be able to specify a partitioning of his data if he knows a 
good one.  It's convenient to have separate partitioners for users and products 
so that no strange mapping step needs to happen.

It may also be reasonable to partition the users and products into different 
numbers of partitions (for instance, to balance memory requirements) if the 
dataset is tall, thin, and very sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1672) Support separate partitioners (and numbers of partitions) for users and products

2014-04-29 Thread Tor Myklebust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tor Myklebust updated SPARK-1672:
-

Component/s: MLlib

> Support separate partitioners (and numbers of partitions) for users and 
> products
> 
>
> Key: SPARK-1672
> URL: https://issues.apache.org/jira/browse/SPARK-1672
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tor Myklebust
>Priority: Minor
>
> The user ought to be able to specify a partitioning of his data if he knows a 
> good one.  It's convenient to have separate partitioners for users and 
> products so that no strange mapping step needs to happen.
> It may also be reasonable to partition the users and products into different 
> numbers of partitions (for instance, to balance memory requirements) if the 
> dataset is tall, thin, and very sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1671) Cached tables should follow write-through policy

2014-04-29 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-1671:
-

 Summary: Cached tables should follow write-through policy
 Key: SPARK-1671
 URL: https://issues.apache.org/jira/browse/SPARK-1671
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian


Writing (insert / load) to a cached table causes cache inconsistency, and user 
have to unpersist and cache the whole table again.

The write-through policy may be implemented with {{RDD.union}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984950#comment-13984950
 ] 

Hari Shreedharan commented on SPARK-1645:
-

The first one is not exactly accurate, though it explains the idea. The second 
is what I suggest.

As a first step, we do what is currently done - the receiver stores the data 
locally and acknowledges, so the reliability does not improve. Later we can 
make the improvement for all receivers that the data becomes persisted all the 
way to the driver (add a new API like storeReliably or something). 

We would have to do a two step Poll-ACK process. We can have the initial poll 
create a new request added to the ones that are pending for commit in the sink. 
Once the receiver has written it (for now in the current way, later reliably) - 
it sends out an ACK for the request id, that causes the request to be 
committed, so Flume can remove the events. If the receiver does not send the 
ACK, then the sink can have a scheduled thread come (this timeout can be 
specified in the flume config) and rollback and make the data available again 
(Flume already has the capability to make uncommitted txns to be made available 
if that agent fails).

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts

2014-04-29 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984948#comment-13984948
 ] 

Pat McDonough commented on SPARK-1670:
--

FYI [~ahirreddy] [~matei], here's the pyspark issue I was talking to you guys 
about

> PySpark Fails to Create SparkContext Due To Debugging Options in 
> conf/java-opts
> ---
>
> Key: SPARK-1670
> URL: https://issues.apache.org/jira/browse/SPARK-1670
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark
> Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
> ...
> IPython 1.1.0
> ...
> Spark version 1.0.0-SNAPSHOT
> Using Python version 2.7.5 (default, Aug 25 2013 00:04:04)
>Reporter: Pat McDonough
>
> When JVM debugging options are in conf/java-opts, it causes pyspark to fail 
> when creating the SparkContext. The java-opts file looks like the following:
> {code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
> {code}
> Here's the error:
> {code}---
> ValueErrorTraceback (most recent call last)
> /Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in 
> execfile(fname, *where)
> 202 else:
> 203 filename = fname
> --> 204 __builtin__.execfile(filename, *where)
> /Users/pat/Projects/spark/python/pyspark/shell.py in ()
>  41 SparkContext.setSystemProperty("spark.executor.uri", 
> os.environ["SPARK_EXECUTOR_URI"])
>  42 
> ---> 43 sc = SparkContext(os.environ.get("MASTER", "local[*]"), 
> "PySparkShell", pyFiles=add_files)
>  44 
>  45 print("""Welcome to
> /Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, 
> master, appName, sparkHome, pyFiles, environment, batchSize, serializer, 
> conf, gateway)
>  92 tempNamedTuple = namedtuple("Callsite", "function file 
> linenum")
>  93 self._callsite = tempNamedTuple(function=None, file=None, 
> linenum=None)
> ---> 94 SparkContext._ensure_initialized(self, gateway=gateway)
>  95 
>  96 self.environment = environment or {}
> /Users/pat/Projects/spark/python/pyspark/context.pyc in 
> _ensure_initialized(cls, instance, gateway)
> 172 with SparkContext._lock:
> 173 if not SparkContext._gateway:
> --> 174 SparkContext._gateway = gateway or launch_gateway()
> 175 SparkContext._jvm = SparkContext._gateway.jvm
> 176 SparkContext._writeToFile = 
> SparkContext._jvm.PythonRDD.writeToFile
> /Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway()
>  44 proc = Popen(command, stdout=PIPE, stdin=PIPE)
>  45 # Determine which ephemeral port the server started on:
> ---> 46 port = int(proc.stdout.readline())
>  47 # Create a thread to echo output from the GatewayServer, which is 
> required
>  48 # for Java log output to show up:
> ValueError: invalid literal for int() with base 10: 'Listening for transport 
> dt_socket at address: 5005\n'
> {code}
> Note that when you use JVM debugging, the very first line of output (e.g. 
> when running spark-shell) looks like this:
> {code}Listening for transport dt_socket at address: 5005{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts

2014-04-29 Thread Pat McDonough (JIRA)
Pat McDonough created SPARK-1670:


 Summary: PySpark Fails to Create SparkContext Due To Debugging 
Options in conf/java-opts
 Key: SPARK-1670
 URL: https://issues.apache.org/jira/browse/SPARK-1670
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
...
IPython 1.1.0
...
Spark version 1.0.0-SNAPSHOT

Using Python version 2.7.5 (default, Aug 25 2013 00:04:04)
Reporter: Pat McDonough


When JVM debugging options are in conf/java-opts, it causes pyspark to fail 
when creating the SparkContext. The java-opts file looks like the following:
{code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
{code}
Here's the error:
{code}---
ValueErrorTraceback (most recent call last)
/Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in 
execfile(fname, *where)
202 else:
203 filename = fname
--> 204 __builtin__.execfile(filename, *where)

/Users/pat/Projects/spark/python/pyspark/shell.py in ()
 41 SparkContext.setSystemProperty("spark.executor.uri", 
os.environ["SPARK_EXECUTOR_URI"])
 42 
---> 43 sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", 
pyFiles=add_files)
 44 
 45 print("""Welcome to

/Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, master, 
appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway)
 92 tempNamedTuple = namedtuple("Callsite", "function file 
linenum")
 93 self._callsite = tempNamedTuple(function=None, file=None, 
linenum=None)
---> 94 SparkContext._ensure_initialized(self, gateway=gateway)
 95 
 96 self.environment = environment or {}

/Users/pat/Projects/spark/python/pyspark/context.pyc in 
_ensure_initialized(cls, instance, gateway)
172 with SparkContext._lock:
173 if not SparkContext._gateway:
--> 174 SparkContext._gateway = gateway or launch_gateway()
175 SparkContext._jvm = SparkContext._gateway.jvm
176 SparkContext._writeToFile = 
SparkContext._jvm.PythonRDD.writeToFile

/Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway()
 44 proc = Popen(command, stdout=PIPE, stdin=PIPE)
 45 # Determine which ephemeral port the server started on:
---> 46 port = int(proc.stdout.readline())
 47 # Create a thread to echo output from the GatewayServer, which is 
required
 48 # for Java log output to show up:

ValueError: invalid literal for int() with base 10: 'Listening for transport 
dt_socket at address: 5005\n'
{code}

Note that when you use JVM debugging, the very first line of output (e.g. when 
running spark-shell) looks like this:
{code}Listening for transport dt_socket at address: 5005{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1669) SQLContext.cacheTable() should be idempotent

2014-04-29 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-1669:
-

 Summary: SQLContext.cacheTable() should be idempotent
 Key: SPARK-1669
 URL: https://issues.apache.org/jira/browse/SPARK-1669
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian


Calling {{cacheTable()}} on some table {{t} multiple times causes table {{t}} 
to  be cached multiple times. This semantics is different from {{RDD.cache()}}, 
which is idempotent.

We can check whether a table is already cached by checking:

# whether the structure of the underlying logical plan of the table is matches 
the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ 
InMemoryColumnarTableScan(_, _)))}}
# whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984927#comment-13984927
 ] 

Tathagata Das commented on SPARK-1645:
--

Ah, I think get it now. So instead of the default push-based as it is now 
(where a sink is running with the receiver), you simply want to make 
pull-based. 

So if the current situation is this 

!http://i.imgur.com/m8oiOwl.png?1!  

you propose this  

!http://i.imgur.com/N6Ee1cb.png?1!

Right?
Assuming it is right, that does make it very convenient for Spark Streaming's 
receivers. However what does it mean for reliable receiving? When the receiver 
pulls the data from the source, it will acknowledge the source only when the 
Spark acknowledges that it has reliably saved the data?


> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1515) Specialized ColumnTypes for Array, Map and Struct

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1515:


Labels: compression  (was: )

> Specialized ColumnTypes for Array, Map and Struct
> -
>
> Key: SPARK-1515
> URL: https://issues.apache.org/jira/browse/SPARK-1515
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>  Labels: compression
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1513) Specialized ColumnType for Timestamp

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1513:


Labels: compression  (was: )

> Specialized ColumnType for Timestamp
> 
>
> Key: SPARK-1513
> URL: https://issues.apache.org/jira/browse/SPARK-1513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>  Labels: compression
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1512) improve spark sql to support table with more than 22 fields

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1512.
-

Resolution: Fixed

> improve spark sql to support table with more than 22 fields
> ---
>
> Key: SPARK-1512
> URL: https://issues.apache.org/jira/browse/SPARK-1512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: wangfei
> Fix For: 1.0.0
>
>
> spark sql use case class to define a table, so 22 fields limit in case 
> classes lead to spark sql not support wide(more than 22 fields) tables. wide 
> table is  common in many cases



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1512) improve spark sql to support table with more than 22 fields

2014-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984922#comment-13984922
 ] 

Michael Armbrust commented on SPARK-1512:
-

Now that we have updated the docs to talk about creating custom Product 
classes, I'm going to mark this as resolved.

> improve spark sql to support table with more than 22 fields
> ---
>
> Key: SPARK-1512
> URL: https://issues.apache.org/jira/browse/SPARK-1512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: wangfei
> Fix For: 1.0.0
>
>
> spark sql use case class to define a table, so 22 fields limit in case 
> classes lead to spark sql not support wide(more than 22 fields) tables. wide 
> table is  common in many cases



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1610) Cast from BooleanType to NumericType should use exact type value.

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1610.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

> Cast from BooleanType to NumericType should use exact type value.
> -
>
> Key: SPARK-1610
> URL: https://issues.apache.org/jira/browse/SPARK-1610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
> Fix For: 1.0.0
>
>
> Cast from BooleanType to NumericType are all using Int value.
> But it causes ClassCastException when the casted value is used by the 
> following evaluation like the code below:
> {quote}
> scala> import org.apache.spark.sql.catalyst._
> import org.apache.spark.sql.catalyst._
> scala> import types._
> import types._
> scala> import expressions._
> import expressions._
> scala> Add(Cast(Literal(true), ShortType), Literal(1.toShort)).eval()
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Short
>   at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102)
>   at scala.math.Numeric$ShortIsIntegral$.plus(Numeric.scala:72)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58)
>   at .(:17)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>   at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
>   at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
>   at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
>   at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
>   at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
>   at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
>   at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
>   at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
>   at 
> scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
>   at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
>   at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
>   at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1608) Cast.nullable should be true when cast from StringType to NumericType/TimestampType

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1608.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

> Cast.nullable should be true when cast from StringType to 
> NumericType/TimestampType
> ---
>
> Key: SPARK-1608
> URL: https://issues.apache.org/jira/browse/SPARK-1608
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
> Fix For: 1.0.0
>
>
> Cast.nullable should be true when cast from StringType to NumericType or 
> TimestampType.
> Because if StringType expression has an illegal number string or illegal 
> timestamp string, the casted value becomes null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984909#comment-13984909
 ] 

Michael Armbrust commented on SPARK-1649:
-

Oh, I see.  I forgot that we would also need this inside of ArrayType.  Also, 
for MapType it seems like it only matters for the value, not the key as I'm not 
sure we would allow null keys.

This is something we need to consider. However, I think I'm going to change the 
title to something less prescriptive.  Could we just for now say that null 
values are not supported in arrays of parquet files?

> Figure out Nullability semantics for Array elements and Map values
> --
>
> Key: SPARK-1649
> URL: https://issues.apache.org/jira/browse/SPARK-1649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Andre Schumacher
>Priority: Critical
>
> For the underlying storage layer it would simplify things such as schema 
> conversions, predicate filter determination and such to record in the data 
> type itself whether a column can be nullable. So the DataType type could look 
> like like this:
> abstract class DataType(nullable: Boolean = true)
> Concrete subclasses could then override the nullable val. Mostly this could 
> be left as the default but when types can be contained in nested types one 
> could optimize for, e.g., arrays with elements that are nullable and those 
> that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1649:


Summary: Figure out Nullability semantics for Array elements and Map values 
 (was: DataType should contain nullable bit)

> Figure out Nullability semantics for Array elements and Map values
> --
>
> Key: SPARK-1649
> URL: https://issues.apache.org/jira/browse/SPARK-1649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Andre Schumacher
>Priority: Critical
>
> For the underlying storage layer it would simplify things such as schema 
> conversions, predicate filter determination and such to record in the data 
> type itself whether a column can be nullable. So the DataType type could look 
> like like this:
> abstract class DataType(nullable: Boolean = true)
> Concrete subclasses could then override the nullable val. Mostly this could 
> be left as the default but when types can be contained in nested types one 
> could optimize for, e.g., arrays with elements that are nullable and those 
> that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1667) Should re-fetch when intermediate data for shuffle is lost

2014-04-29 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984900#comment-13984900
 ] 

Kousuke Saruta commented on SPARK-1667:
---

Now I'm trying to address this issue.

> Should re-fetch when intermediate data for shuffle is lost
> --
>
> Key: SPARK-1667
> URL: https://issues.apache.org/jira/browse/SPARK-1667
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.0.0
>Reporter: Kousuke Saruta
>
> I met a case that re-fetch wouldn't occur although that should occur.
> When intermediate data (phisical file of intermediate data on local file 
> system) which is used for shuffle is lost from a Executor, 
> FileNotFoundException was thrown and refetch wouldn't occur.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984816#comment-13984816
 ] 

Hari Shreedharan commented on SPARK-1645:
-

No, Flume source and sink reside within the same 
JVM(http://flume.apache.org/FlumeUserGuide.html#architecture). So the receiver 
polls the Flume sink running on a different node (the node that runs the Flume 
agent pushing the data). If the node running the receiver goes down, then 
another worker starts up and reads from the same Flume agent. If the Flume 
agent goes down the receiver polls and fails to get data until the agent is 
back up. 

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984800#comment-13984800
 ] 

Tathagata Das commented on SPARK-1645:
--

But this does not solve the scenario where the whole worker running the 
receiver dies. If worker dies, then receiver and sink all of it is gone, and 
Flume source has nowhere to send the data to, isnt it?

As far as I understand, the only way to deal with a worker failure is to 
configure a pool of workers as sinks. If the one of the sinks dont work because 
the worker failed, the second sink on a the second worker can still receive 
data. Am I missing something?

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984780#comment-13984780
 ] 

Hari Shreedharan commented on SPARK-1645:
-

No, the sink would run inside the Flume agent that Spark is receiving data 
from. (Sink is a flume component that pushes data out - this is managed by 
Flume). Basically, this sink pulls data from the Flume agent's buffer when 
Spark receiver polls it. If the receiver dies and restarts, as long as the 
receiver knows which agent to poll the receiver will be able to get the data. 
This solves the case where Flume is pushing data to a receiver which may have 
died and restarted elsewhere - since Spark now polls Flume

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984768#comment-13984768
 ] 

Tathagata Das commented on SPARK-1645:
--

Let me understand this. Is this sink going to run as a separate process outside 
the Spark executor? If it is running as a thread in the same executor process 
as the receiver, then that is no better than what it is now, as it will fail 
when the executor fails. So I am guessing it will be a process outside the 
executor. Doesnt introduce the headache of managing that process separately? 
And what happens when the whole worker node dies? 

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984734#comment-13984734
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yes, so I have a rough design for that in mind. The idea is to add a sink which 
plugs into Flume, which gets polled by the Spark receiver. That way, even if 
the node on which the worker is running fails, the receiver on another node can 
poll the sink and pull data. From the Flume point of view, the sink does not 
"conform" to the definition of standard sinks (all Flume sinks are push only), 
but it can be written such that we don't lose data. Later if/when Flume adds 
support for pollable sinks this sink can be ported.

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1569) Spark on Yarn, authentication broken by pr299

2014-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984713#comment-13984713
 ] 

Thomas Graves edited comment on SPARK-1569 at 4/29/14 8:00 PM:
---

{quote}
@tgravescs ah I see, you're right. I think I assumed incorrectly that the 
executor launcher would bundle up the options and send them over, but I don't 
actually see that happening anywhere. So this part of the code is actually not 
used:

https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328

What happens is the executor is just getting its configuration from the driver 
when the executor launches. And that works in most cases except for security, 
which it needs to know about before connecting. Is that right?
{quote}

That is correct. It needs it before connecting.  The code you reference handles 
add it for the application master but not the executors. it looks like we need 
similar code in ExecutorRunnableUtil.prepareCommand. 

Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command 
line so it was always set correctly when the executors launched.



was (Author: tgraves):

@tgravescs ah I see, you're right. I think I assumed incorrectly that the 
executor launcher would bundle up the options and send them over, but I don't 
actually see that happening anywhere. So this part of the code is actually not 
used:

https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328

What happens is the executor is just getting its configuration from the driver 
when the executor launches. And that works in most cases except for security, 
which it needs to know about before connecting. Is that right?


That is correct. It needs it before connecting.  The code you reference handles 
add it for the application master but not the executors. it looks like we need 
similar code in ExecutorRunnableUtil.prepareCommand. 

Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command 
line so it was always set correctly when the executors launched.


> Spark on Yarn, authentication broken by pr299
> -
>
> Key: SPARK-1569
> URL: https://issues.apache.org/jira/browse/SPARK-1569
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> https://github.com/apache/spark/pull/299 changed the way configuration was 
> done and passed to the executors.  This breaks use of authentication as the 
> executor needs to know that authentication is enabled before connecting to 
> the driver.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1569) Spark on Yarn, authentication broken by pr299

2014-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984713#comment-13984713
 ] 

Thomas Graves commented on SPARK-1569:
--


@tgravescs ah I see, you're right. I think I assumed incorrectly that the 
executor launcher would bundle up the options and send them over, but I don't 
actually see that happening anywhere. So this part of the code is actually not 
used:

https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328

What happens is the executor is just getting its configuration from the driver 
when the executor launches. And that works in most cases except for security, 
which it needs to know about before connecting. Is that right?


That is correct. It needs it before connecting.  The code you reference handles 
add it for the application master but not the executors. it looks like we need 
similar code in ExecutorRunnableUtil.prepareCommand. 

Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command 
line so it was always set correctly when the executors launched.


> Spark on Yarn, authentication broken by pr299
> -
>
> Key: SPARK-1569
> URL: https://issues.apache.org/jira/browse/SPARK-1569
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> https://github.com/apache/spark/pull/299 changed the way configuration was 
> done and passed to the executors.  This breaks use of authentication as the 
> executor needs to know that authentication is enabled before connecting to 
> the driver.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1588) SPARK_JAVA_OPTS and SPARK_YARN_USER_ENV are not getting propagated

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1588.


   Resolution: Fixed
Fix Version/s: 1.0.0

> SPARK_JAVA_OPTS and SPARK_YARN_USER_ENV are not getting propagated
> --
>
> Key: SPARK-1588
> URL: https://issues.apache.org/jira/browse/SPARK-1588
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Mridul Muralidharan
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We could use SPARK_JAVA_OPTS to pass JAVA_OPTS to be used in the master.
> This is no longer working in current master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS

2014-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1668:


 Summary: Add implicit preference as an option to 
examples/MovieLensALS
 Key: SPARK-1668
 URL: https://issues.apache.org/jira/browse/SPARK-1668
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor


Add --implicitPrefs as an command-line option to the example app MovieLensALS 
under examples/. For evaluation, we should map ratings to range [0, 1] and 
compare it with predictions. It would be better if we add unobserved ratings 
(assuming negatives) to evaluation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984654#comment-13984654
 ] 

Tathagata Das commented on SPARK-1645:
--

Yes, we will keep you posted. 

Though one thing that is reasonably independent is to add the ability for Flume 
receivers to be launched on multiple workers, such that one can act ask 
standby, when the primary receiver fails. 

> Improve Spark Streaming compatibility with Flume
> 
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1100) saveAsTextFile shouldn't clobber by default

2014-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1100:
---

Assignee: Patrick Wendell  (was: Patrick Cogan)

> saveAsTextFile shouldn't clobber by default
> ---
>
> Key: SPARK-1100
> URL: https://issues.apache.org/jira/browse/SPARK-1100
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 0.9.0
>Reporter: Diana Carroll
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> If I call rdd.saveAsTextFile with an existing directory, it will cheerfully 
> and silently overwrite the files in there.  This is bad enough if it means 
> I've accidentally blown away the results of a job that might have taken 
> minutes or hours to run.  But it's worse if the second job happens to have 
> fewer partitions than the first...in that case, my output directory now 
> contains some "part" files from the earlier job, and some "part" files from 
> the later job.  The only way to know the difference is timestamp.
> I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce 
> which insists that the output directory not exist before the job starts.  
> Similarly HDFS won't override files by default.  Perhaps there could be an 
> optional argument for saveAsTextFile that indicates if it should delete the 
> existing directory before starting.  (I can't see any time I'd want to allow 
> writing to an existing directory with data already in it.  Would the mix of 
> output from different tasks ever be desirable?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1667) Should re-fetch when intermediate data for shuffle is lost

2014-04-29 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-1667:
-

 Summary: Should re-fetch when intermediate data for shuffle is lost
 Key: SPARK-1667
 URL: https://issues.apache.org/jira/browse/SPARK-1667
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Kousuke Saruta


I met a case that re-fetch wouldn't occur although that should occur.
When intermediate data (phisical file of intermediate data on local file 
system) which is used for shuffle is lost from a Executor, 
FileNotFoundException was thrown and refetch wouldn't occur.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1569) Spark on Yarn, authentication broken by pr299

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1569:
-

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-1652

> Spark on Yarn, authentication broken by pr299
> -
>
> Key: SPARK-1569
> URL: https://issues.apache.org/jira/browse/SPARK-1569
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> https://github.com/apache/spark/pull/299 changed the way configuration was 
> done and passed to the executors.  This breaks use of authentication as the 
> executor needs to know that authentication is enabled before connecting to 
> the driver.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-1625.


Resolution: Fixed

> Ensure all legacy YARN options are supported with spark-submit
> --
>
> Key: SPARK-1625
> URL: https://issues.apache.org/jira/browse/SPARK-1625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit

2014-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984519#comment-13984519
 ] 

Thomas Graves commented on SPARK-1625:
--

I'll create a separate jira for that.

> Ensure all legacy YARN options are supported with spark-submit
> --
>
> Key: SPARK-1625
> URL: https://issues.apache.org/jira/browse/SPARK-1625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1664:
-

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-1652

> spark-submit --name doesn't work in yarn-client mode
> 
>
> Key: SPARK-1664
> URL: https://issues.apache.org/jira/browse/SPARK-1664
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> When using spark-submit in yarn-client mode, the --name option doesn't 
> properly set the application name in either the ResourceManager UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1666) document examples

2014-04-29 Thread Diana Carroll (JIRA)
Diana Carroll created SPARK-1666:


 Summary: document examples
 Key: SPARK-1666
 URL: https://issues.apache.org/jira/browse/SPARK-1666
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Diana Carroll


It would be great if there were some guidance about what the example code 
shipped with Spark (under $SPARKHOME/examples and $SPARKHOME/python/examples) 
does and how to run it.  Perhaps a comment block at the beginning explaining 
what the code accomplishes and what parameters it takes.  Also, if there are 
sample datasets on which the example is designed to run, please point to those. 
 

(As an example, look at kmeans.py, which takes a file argument, but has no hint 
about what sort of data is in the file or what format the data should be in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1665) add a config to replace SPARK_YARN_USER_ENV

2014-04-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1665:


 Summary: add a config to replace SPARK_YARN_USER_ENV
 Key: SPARK-1665
 URL: https://issues.apache.org/jira/browse/SPARK-1665
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves


we should add a config to replace the env variable SPARK_YARN_USER_ENV.  If it 
makes sense we should make it generic to all Spark.  If it doesn't then we 
should atleast have a yarn config so we aren't using environment variables 
anymore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1664:
-

Priority: Blocker  (was: Major)

> spark-submit --name doesn't work in yarn-client mode
> 
>
> Key: SPARK-1664
> URL: https://issues.apache.org/jira/browse/SPARK-1664
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> When using spark-submit in yarn-client mode, the --name option doesn't 
> properly set the application name in either the ResourceManager UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode

2014-04-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1664:


 Summary: spark-submit --name doesn't work in yarn-client mode
 Key: SPARK-1664
 URL: https://issues.apache.org/jira/browse/SPARK-1664
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves


When using spark-submit in yarn-client mode, the --name option doesn't properly 
set the application name in either the ResourceManager UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1663) Spark Streaming docs code has several small errors

2014-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984466#comment-13984466
 ] 

Sean Owen commented on SPARK-1663:
--

PR: https://github.com/apache/spark/pull/589

> Spark Streaming docs code has several small errors
> --
>
> Key: SPARK-1663
> URL: https://issues.apache.org/jira/browse/SPARK-1663
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Priority: Minor
>  Labels: streaming
>
> The changes are easiest to elaborate in the PR, which I will open shortly.
> Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1663) Spark Streaming docs code has several small errors

2014-04-29 Thread Sean Owen (JIRA)
Sean Owen created SPARK-1663:


 Summary: Spark Streaming docs code has several small errors
 Key: SPARK-1663
 URL: https://issues.apache.org/jira/browse/SPARK-1663
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor


The changes are easiest to elaborate in the PR, which I will open shortly.

Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1557) Set permissions on event log files/directories

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1557.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

> Set permissions on event log files/directories
> --
>
> Key: SPARK-1557
> URL: https://issues.apache.org/jira/browse/SPARK-1557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.0.0
>
>
> We should set the permissions on the event log directories and files so that 
> it restricts access to only those users who own them, but could also allow a 
> super user to read them so that they could be displayed by the history server 
> in a multi-tenant secure environment. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1639) Some tidying of Spark on YARN code

2014-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984318#comment-13984318
 ] 

Thomas Graves commented on SPARK-1639:
--

https://github.com/apache/spark/pull/561

> Some tidying of Spark on YARN code
> --
>
> Key: SPARK-1639
> URL: https://issues.apache.org/jira/browse/SPARK-1639
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 0.9.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> I found a few places where we can consolidate duplicate methods, fix typos, 
> add comments, and make what's going on more clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit

2014-04-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reopened SPARK-1625:
--


These aren't the only things broken. One big issue is that authentication isn't 
being passed properly anymore.  Unless that was fixed under another jira?

> Ensure all legacy YARN options are supported with spark-submit
> --
>
> Key: SPARK-1625
> URL: https://issues.apache.org/jira/browse/SPARK-1625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1662) PySpark fails if python class is used as a data container

2014-04-29 Thread Chandan Kumar (JIRA)
Chandan Kumar created SPARK-1662:


 Summary: PySpark fails if python class is used as a data container
 Key: SPARK-1662
 URL: https://issues.apache.org/jira/browse/SPARK-1662
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Ubuntu 14, Python 2.7.6
Reporter: Chandan Kumar
Priority: Minor


PySpark fails if RDD operations are performed on data encapsulated in Python 
objects (rare use case where plain python objects are used as data containers 
instead of regular dict or tuples).

I have written a small piece of code to reproduce the bug:
https://gist.github.com/nrchandan/11394440

https://gist.github.com/nrchandan/11394440.js";>





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1644) hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an exception

2014-04-29 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984114#comment-13984114
 ] 

Guoqiang Li commented on SPARK-1644:


{code}
# When Hive support is needed, Datanucleus jars must be included on the 
classpath.
# Datanucleus jars do not work if only included in the  uber jar as plugin.xml 
metadata is lost.
# Both sbt and maven will populate "lib_managed/jars/" with the datanucleus 
jars when Spark is
# built with Hive, so first check if the datanucleus jars exist, and then 
ensure the current Spark
# assembly is built for Hive, before actually populating the CLASSPATH with the 
jars.
# Note that this check order is faster (by up to half a second) in the case 
where Hive is not used.
num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ 2>/dev/null | grep 
"datanucleus-.*\\.jar" | wc -l)
if [ $num_datanucleus_jars -gt 0 ]; then
  AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
  num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 
2>/dev/null | wc -l)
  if [ $num_hive_files -gt 0 ]; then
echo "Spark assembly has been built with Hive, including Datanucleus jars 
on classpath" 1>&2
DATANUCLEUSJARS=$(echo "$FWDIR/lib_managed/jars"/datanucleus-*.jar | tr " " 
:)
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS
  fi
fi
{code} 
only add /lib_managed/jars of files to the CLASSPATH. In the current directory 
is dist is unable to work
 


>  hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an 
> exception
> -
>
> Key: SPARK-1644
> URL: https://issues.apache.org/jira/browse/SPARK-1644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Guoqiang Li 
>Assignee: Guoqiang Li
> Attachments: spark.log
>
>
> cat conf/hive-site.xml
> {code:xml}
> 
>   
> javax.jdo.option.ConnectionURL
> jdbc:postgresql://bj-java-hugedata1:7432/hive
>   
>   
> javax.jdo.option.ConnectionDriverName
> org.postgresql.Driver
>   
>   
> javax.jdo.option.ConnectionUserName
> hive
>   
>   
> javax.jdo.option.ConnectionPassword
> passwd
>   
>   
> hive.metastore.local
> false
>   
>   
> hive.metastore.warehouse.dir
> hdfs://host:8020/user/hive/warehouse
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`

2014-04-29 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li reassigned SPARK-1629:
--

Assignee: Guoqiang Li

> Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` 
> 
>
> Key: SPARK-1629
> URL: https://issues.apache.org/jira/browse/SPARK-1629
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li 
>Assignee: Guoqiang Li
>Priority: Minor
>
> Right now we use this but don't depend on it explicitly (which is wrong). We 
> should probably just inline this function and remove the need to add a 
> dependency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1644) hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an exception

2014-04-29 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li reassigned SPARK-1644:
--

Assignee: Guoqiang Li

>  hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an 
> exception
> -
>
> Key: SPARK-1644
> URL: https://issues.apache.org/jira/browse/SPARK-1644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Guoqiang Li 
>Assignee: Guoqiang Li
> Attachments: spark.log
>
>
> cat conf/hive-site.xml
> {code:xml}
> 
>   
> javax.jdo.option.ConnectionURL
> jdbc:postgresql://bj-java-hugedata1:7432/hive
>   
>   
> javax.jdo.option.ConnectionDriverName
> org.postgresql.Driver
>   
>   
> javax.jdo.option.ConnectionUserName
> hive
>   
>   
> javax.jdo.option.ConnectionPassword
> passwd
>   
>   
> hive.metastore.local
> false
>   
>   
> hive.metastore.warehouse.dir
> hdfs://host:8020/user/hive/warehouse
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1636) Move main methods to examples

2014-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1636.
--

   Resolution: Done
Fix Version/s: 1.0.0

> Move main methods to examples
> -
>
> Key: SPARK-1636
> URL: https://issues.apache.org/jira/browse/SPARK-1636
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>
> Move the main methods to examples and make them compatible with spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)