[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2015-06-17 Thread Sebastian Alfers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589388#comment-14589388
 ] 

Sebastian Alfers commented on SPARK-7334:
-

I implemented RP as a transformer to be able to serialize the model and re-use 
it later.
Also, the actual implementation of RP is separated and (theoretically) can be 
used in LSH.

I implemented RP as a "stand alone" method as a replacement / comparison to PCA.

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8402) DP means clustering

2015-06-17 Thread Meethu Mathew (JIRA)
Meethu Mathew created SPARK-8402:


 Summary: DP means clustering 
 Key: SPARK-8402
 URL: https://issues.apache.org/jira/browse/SPARK-8402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Meethu Mathew


At present, all the clustering algorithms in MLlib require the number of 
clusters to be specified in advance. 
The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
that allows for flexible clustering of data without having to specify apriori 
the number of clusters. 
DP means is a non-parametric clustering algorithm that uses a scale parameter 
'lambda' to control the creation of new clusters["Revisiting k-means: New 
Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].

We have followed the distributed implementation of DP means which has been 
proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
by Xinghao Pan, Evan R. Sparks, Andre Wibisono.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8402) DP means clustering

2015-06-17 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589392#comment-14589392
 ] 

Meethu Mathew commented on SPARK-8402:
--

Could anyone please assign this ticket to me ?

> DP means clustering 
> 
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Meethu Mathew
>  Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of 
> clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model 
> that allows for flexible clustering of data without having to specify apriori 
> the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 
> 'lambda' to control the creation of new clusters["Revisiting k-means: New 
> Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been 
> proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" 
> by Xinghao Pan, Evan R. Sparks, Andre Wibisono.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-06-17 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589398#comment-14589398
 ] 

Tao Wang commented on SPARK-6882:
-

Did you try to set hive.server2.thrift.sasl.qop to "auth-conf"?

> Spark ThriftServer2 Kerberos failed encountering 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> 
>
> Key: SPARK-6882
> URL: https://issues.apache.org/jira/browse/SPARK-6882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
> Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
> * Apache Hive 0.13.1
> * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
> * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>Reporter: Andrew Lee
>
> When Kerberos is enabled, I get the following exceptions. 
> {code}
> 2015-03-13 18:26:05,363 ERROR 
> org.apache.hive.service.cli.thrift.ThriftCLIService 
> (ThriftBinaryCLIService.java:run(93)) - Error: 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> {code}
> I tried it in
> * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
> * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
> with
> * Apache Hive 0.13.1
> * Apache Hadoop 2.4.1
> Build command
> {code}
> mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
> -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
> install
> {code}
> When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
> start thriftserver looks like this
> {code}
> ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
> hive.server2.thrift.bind.host=$(hostname) --master yarn-client
> {code}
> {{hostname}} points to the current hostname of the machine I'm using.
> Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
> {code}
> 2015-03-13 18:26:05,363 ERROR 
> org.apache.hive.service.cli.thrift.ThriftCLIService 
> (ThriftBinaryCLIService.java:run(93)) - Error: 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
> at 
> org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
> at 
> org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
> at 
> org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> I'm wondering if this is due to the same problem described in HIVE-8154 
> HIVE-7620 due to an older code based for the Spark ThriftServer?
> Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
> run against a Kerberos cluster (Apache 2.4.1).
> My hive-site.xml looks like the following for spark/conf.
> The kerberos keytab and tgt are configured correctly, I'm able to connect to 
> metastore, but the subsequent steps failed due to the exception.
> {code}
> 
>   hive.semantic.analyzer.factory.impl
>   org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
>   hive.stats.autogather
>   false
> 
> 
>   hive.session.history.enabled
>   true
> 
> 
>   hive.querylog.location
>   /tmp/home/hive/log/${user.name}
> 
> 
>   hive.exec.local.scratchdir
>   /tmp/hive/scratch/${user.name}
> 
> 
>   hive.metastore.uris
>   thrift://somehostname:9083
> 
> 
> 
>   hive.server2.authentication
>   KERBEROS
> 
> 
>   hive.server2.authentication.kerberos.principal
>   ***
> 
> 
>   hive.server2.authentication.kerberos.keytab
>   ***
> 
> 
>   hive.server2.thrift.sasl.qop
>   auth
>   Sasl QOP value; one of 'auth', 'auth-int' and 
> 'auth-conf'
> 
> 
>   hive.server2.enable.impersonation
>   Enable user impersonation for HiveServer2
>   true
> 
> 
> 
>   hive.metastore.sasl.enabled
>   true
> 
> 
>   hive.metastore.kerberos.keytab.file
>   ***
> 
> 
>   hive.metastore.kerberos.principal
>   ***
> 
> 
>   hive.metastore.cache.pinobjtypes
>   Table,Database,Type,FieldSchema,Order
> 
> 
>   hdfs_sentinel_file
>   ***
> 
> 
>   hive.metastore.warehouse.dir
>   /hive
> 
> 
>   hive.metastore.client.socket.timeout
>   600
> 
> 
>   hive.warehouse.subdir.inherit.perms
>   true
> 
> {code}
> Here, I'm attaching a more detail logs from Spark 1.3 rc1.
> {code}
> 2015-04-13 16:37:20,688 INFO  org.apache.hadoop.security.UserGroupInformation 
> (UserGroupInformation.java:loginUserFromKeytab(893)) - Login successful for 
> us

[jira] [Created] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates

2015-06-17 Thread Hong Shen (JIRA)
Hong Shen created SPARK-8403:


 Summary: Pruner partition won't effective when udf exit in sql 
predicates
 Key: SPARK-8403
 URL: https://issues.apache.org/jira/browse/SPARK-8403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Hong Shen


When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table fromt_dw_qqlive_209026.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Description: 
When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table t_dw_qqlive_209026.



  was:
When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table from t_dw_qqlive_209026.




> Pruner partition won't effective when udf exit in sql predicates
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When udf exit in sql predicates, pruner partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Description: 
When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table from t_dw_qqlive_209026.



  was:
When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table fromt_dw_qqlive_209026.




> Pruner partition won't effective when udf exit in sql predicates
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When udf exit in sql predicates, pruner partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table from t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7667:
---
Description: 
Check and ensure the MLlib Python API(class/method/parameter) consistent with 
Scala.

The following APIs are not consistent:
* class
* method
** recommendation.MatrixFactorizationModel.predictAll()
* parameter
** feature.StandardScaler.fit()
** many transform() function of feature module

  was:
Check and ensure the MLlib Python API(class/method/parameter) consistent with 
Scala.

The following APIs are not consistent:
* class
* method
* parameter
** feature.StandardScaler.fit()
** many transform() function of feature module


> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll()
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-06-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589408#comment-14589408
 ] 

Sean Owen commented on SPARK-7009:
--

1.4 should be built with Java 6, so shouldn't exhibit this problem. I haven't 
heard of problems using pyspark unless you have Java 7 in the mix here 
somewhere. However I don't know what will happen for 1.5 since it requires Java 
7. Are later versions of Python able to read large jars? Maybe those become 
required. Really not my area though.

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8203) conditional functions: greatest

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8203:
---

Assignee: (was: Apache Spark)

> conditional functions: greatest
> ---
>
> Key: SPARK-8203
> URL: https://issues.apache.org/jira/browse/SPARK-8203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> greatest(T v1, T v2, ...): T
> Returns the greatest value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8203) conditional functions: greatest

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589415#comment-14589415
 ] 

Apache Spark commented on SPARK-8203:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6851

> conditional functions: greatest
> ---
>
> Key: SPARK-8203
> URL: https://issues.apache.org/jira/browse/SPARK-8203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> greatest(T v1, T v2, ...): T
> Returns the greatest value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8204) conditional function: least

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8204:
---

Assignee: Apache Spark

> conditional function: least
> ---
>
> Key: SPARK-8204
> URL: https://issues.apache.org/jira/browse/SPARK-8204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> least(T v1, T v2, ...): T
> Returns the least value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8204) conditional function: least

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8204:
---

Assignee: (was: Apache Spark)

> conditional function: least
> ---
>
> Key: SPARK-8204
> URL: https://issues.apache.org/jira/browse/SPARK-8204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> least(T v1, T v2, ...): T
> Returns the least value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8203) conditional functions: greatest

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8203:
---

Assignee: Apache Spark

> conditional functions: greatest
> ---
>
> Key: SPARK-8203
> URL: https://issues.apache.org/jira/browse/SPARK-8203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> greatest(T v1, T v2, ...): T
> Returns the greatest value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8204) conditional function: least

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589416#comment-14589416
 ] 

Apache Spark commented on SPARK-8204:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6851

> conditional function: least
> ---
>
> Key: SPARK-8204
> URL: https://issues.apache.org/jira/browse/SPARK-8204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> least(T v1, T v2, ...): T
> Returns the least value of the list of values (as of Hive 1.1.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8404) Use thread-safe collections to make KafkaStreamSuite tests more reliable

2015-06-17 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8404:
---

 Summary: Use thread-safe collections to make KafkaStreamSuite 
tests more reliable
 Key: SPARK-8404
 URL: https://issues.apache.org/jira/browse/SPARK-8404
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Shixiong Zhu


Fix the non-thread-safe codes in KafkaStreamSuite, DirectKafkaStreamSuite, 
JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8404) Use thread-safe collections to make KafkaStreamSuite tests more reliable

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589439#comment-14589439
 ] 

Apache Spark commented on SPARK-8404:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6852

> Use thread-safe collections to make KafkaStreamSuite tests more reliable
> 
>
> Key: SPARK-8404
> URL: https://issues.apache.org/jira/browse/SPARK-8404
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Tests
>Reporter: Shixiong Zhu
>
> Fix the non-thread-safe codes in KafkaStreamSuite, DirectKafkaStreamSuite, 
> JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8404) Use thread-safe collections to make KafkaStreamSuite tests more reliable

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8404:
---

Assignee: Apache Spark

> Use thread-safe collections to make KafkaStreamSuite tests more reliable
> 
>
> Key: SPARK-8404
> URL: https://issues.apache.org/jira/browse/SPARK-8404
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Tests
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Fix the non-thread-safe codes in KafkaStreamSuite, DirectKafkaStreamSuite, 
> JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-17 Thread Carson Wang (JIRA)
Carson Wang created SPARK-8405:
--

 Summary: Show executor logs on Web UI when Yarn log aggregation is 
enabled
 Key: SPARK-8405
 URL: https://issues.apache.org/jira/browse/SPARK-8405
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang


When running Spark application in Yarn mode and Yarn log aggregation is 
enabled, customer is not able to view executor logs on the history server Web 
UI. The only way for customer to view the logs is through the Yarn command 
"yarn logs -applicationId ".

An screenshot of the error is attached. When you click an executor’s log link 
on the Spark history server, you’ll see the error if Yarn log aggregation is 
enabled. The log URL redirects user to the node manager’s UI. This works if the 
logs are located on that node. But since log aggregation is enabled, the local 
logs are deleted once log aggregation is completed. 

The logs should be available through the web UIs just like other Hadoop 
components like MapReduce. For security reasons, end users may not be able to 
log into the nodes and run the yarn logs -applicationId command. The web UIs 
can be viewable and exposed through the firewall if necessary.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8404) Use thread-safe collections to make KafkaStreamSuite tests more reliable

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8404:
---

Assignee: (was: Apache Spark)

> Use thread-safe collections to make KafkaStreamSuite tests more reliable
> 
>
> Key: SPARK-8404
> URL: https://issues.apache.org/jira/browse/SPARK-8404
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Tests
>Reporter: Shixiong Zhu
>
> Fix the non-thread-safe codes in KafkaStreamSuite, DirectKafkaStreamSuite, 
> JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-17 Thread Carson Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carson Wang updated SPARK-8405:
---
Attachment: SparkLogError.png

> Show executor logs on Web UI when Yarn log aggregation is enabled
> -
>
> Key: SPARK-8405
> URL: https://issues.apache.org/jira/browse/SPARK-8405
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Carson Wang
> Attachments: SparkLogError.png
>
>
> When running Spark application in Yarn mode and Yarn log aggregation is 
> enabled, customer is not able to view executor logs on the history server Web 
> UI. The only way for customer to view the logs is through the Yarn command 
> "yarn logs -applicationId ".
> An screenshot of the error is attached. When you click an executor’s log link 
> on the Spark history server, you’ll see the error if Yarn log aggregation is 
> enabled. The log URL redirects user to the node manager’s UI. This works if 
> the logs are located on that node. But since log aggregation is enabled, the 
> local logs are deleted once log aggregation is completed. 
> The logs should be available through the web UIs just like other Hadoop 
> components like MapReduce. For security reasons, end users may not be able to 
> log into the nodes and run the yarn logs -applicationId command. The web UIs 
> can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-8406:
-

 Summary: Race condition when writing Parquet files
 Key: SPARK-8406
 URL: https://issues.apache.org/jira/browse/SPARK-8406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


To support appending, the Parquet data source tries to find out the max ID of 
part-files in the destination directory (the  in output file name 
"part-r-.gz.parquet") at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max ID generated by newly written files by other finished tasks within the same 
job. This actually causes a race condition. In most cases, this only causes 
nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same ID, 
thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode("overwrite").parquet("foo")
{code}
"16" can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8407) complex type constructors: struct and named_struct

2015-06-17 Thread Yijie Shen (JIRA)
Yijie Shen created SPARK-8407:
-

 Summary: complex type constructors: struct and named_struct
 Key: SPARK-8407
 URL: https://issues.apache.org/jira/browse/SPARK-8407
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yijie Shen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8406:
--
Description: 
To support appending, the Parquet data source tries to find out the max ID of 
part-files in the destination directory (the  in output file name 
"part-r-.gz.parquet") at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max ID generated by newly written files by other finished tasks within the same 
job. This actually causes a race condition. In most cases, this only causes 
nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same ID, 
thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode("overwrite").parquet("foo")
{code}
"16" can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-6.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-7.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-8.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00017.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00018.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00019.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00020.gz.parquet
-rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
/user/lian/foo/part-r-00021.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00022.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00023.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-00024.gz.parquet
{noformat}
Notice that the newly added ORC data source doesn't suffer this issue because 
it uses both task ID and {{System.currentTimeMills()}} to generate the output 
file name.

  was:
To support appending, the Parquet data source tries to find out the max ID of 
part-files in the destination directory (the  in output file name 
"part-r-.gz.parquet") at the beginning of the write job. In 1.3.0, this 
step happens on driver side before any files are written. However, in 1.4.0, 
this is moved to task side. Thus, for tasks scheduled later, they may see wrong 
max ID generated by newly written files by other finished tasks within the same 
job. This actually causes a race condition. In most cases, this only causes 
nonconsecutive IDs in output file names. But when the DataFrame contains 
thousands of RDD partitions, it's likely that two tasks may choose the same ID, 
thus one of them gets overwritten by the other.

The data loss situation is not quite easy to reproduce. But the following Spark 
shell snippet can reproduce nonconsecutive output file IDs:
{code}
sqlContext.range(0, 128).repartition(16).write.mode("overwrite").parquet("foo")
{code}
"16" can be replaced with any integer that is greater than the default 
parallelism on your machine (usually it means core number, on my machine it's 
8).
{noformat}
-rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
/user/lian/foo/_SUCCESS
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-1.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-2.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-3.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-4.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo/part-r-5.gz.parquet
-rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
/user/lian/foo

[jira] [Updated] (SPARK-8407) complex type constructors: struct and named_struct

2015-06-17 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-8407:
--
Description: 
struct(val1, val2, val3, ...)
Creates a struct with the given field values. Struct field names will be col1, 
col2, 

named_struct(name1, val1, name2, val2, ...)
Creates a struct with the given field names and values. (As of Hive 0.8.0.)

> complex type constructors: struct and named_struct
> --
>
> Key: SPARK-8407
> URL: https://issues.apache.org/jira/browse/SPARK-8407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yijie Shen
>
> struct(val1, val2, val3, ...)
> Creates a struct with the given field values. Struct field names will be 
> col1, col2, 
> named_struct(name1, val1, name2, val2, ...)
> Creates a struct with the given field names and values. (As of Hive 0.8.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-17 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589475#comment-14589475
 ] 

Carson Wang commented on SPARK-8405:


I had some in progress works and here is the approach I was using.
1. If Yarn log aggreation is enabled, we update each executor's log URL on the 
history server. The new URL link is a new added log page hosted on the history 
server. These URLs are passed the same as how other URLs are passed. So we have 
enough information like the container Id, appOwner, etc.
2. The log page reads the aggregated logs from HDFS by using Yarn APIs. 

This is transparent to the end users. If Yarn log aggreation is not enabled, 
nothing is changed. If it is eanbled, the end user will be albe to click the 
executor's log link and view the logs on Web UI. 

Is there any concerns regarding reading the aggregated logs from HDFS? The Map 
Reduce history server reads the aggregated logs from HDFS as well to show the 
logs so I suppose it is ok for Spark history server to read it.

> Show executor logs on Web UI when Yarn log aggregation is enabled
> -
>
> Key: SPARK-8405
> URL: https://issues.apache.org/jira/browse/SPARK-8405
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Carson Wang
> Attachments: SparkLogError.png
>
>
> When running Spark application in Yarn mode and Yarn log aggregation is 
> enabled, customer is not able to view executor logs on the history server Web 
> UI. The only way for customer to view the logs is through the Yarn command 
> "yarn logs -applicationId ".
> An screenshot of the error is attached. When you click an executor’s log link 
> on the Spark history server, you’ll see the error if Yarn log aggregation is 
> enabled. The log URL redirects user to the node manager’s UI. This works if 
> the logs are located on that node. But since log aggregation is enabled, the 
> local logs are deleted once log aggregation is completed. 
> The logs should be available through the web UIs just like other Hadoop 
> components like MapReduce. For security reasons, end users may not be able to 
> log into the nodes and run the yarn logs -applicationId command. The web UIs 
> can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8407) complex type constructors: struct and named_struct

2015-06-17 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589473#comment-14589473
 ] 

Yijie Shen commented on SPARK-8407:
---

I will take this one.

> complex type constructors: struct and named_struct
> --
>
> Key: SPARK-8407
> URL: https://issues.apache.org/jira/browse/SPARK-8407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yijie Shen
>
> struct(val1, val2, val3, ...)
> Creates a struct with the given field values. Struct field names will be 
> col1, col2, 
> named_struct(name1, val1, name2, val2, ...)
> Creates a struct with the given field names and values. (As of Hive 0.8.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-17 Thread Jaromir Vanek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589483#comment-14589483
 ] 

Jaromir Vanek commented on SPARK-8393:
--

{{awaitTerminationOrTimeout}} also throws {{InterruptedException}} in case the 
thread is interrupted, the only difference is the "timeout" feature.

> JavaStreamingContext#awaitTermination() throws non-declared 
> InterruptedException
> 
>
> Key: SPARK-8393
> URL: https://issues.apache.org/jira/browse/SPARK-8393
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Jaromir Vanek
>Priority: Trivial
>
> Call to {{JavaStreamingContext#awaitTermination()}} can throw 
> {{InterruptedException}} which cannot be caught easily in Java because it's 
> not declared in {{@throws(classOf[InterruptedException])}} annotation.
> This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
> Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7088) [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589484#comment-14589484
 ] 

Apache Spark commented on SPARK-7088:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/6853

> [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans
> -
>
> Key: SPARK-7088
> URL: https://issues.apache.org/jira/browse/SPARK-7088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Santiago M. Mola
>Priority: Critical
>  Labels: regression
>
> We're using some custom logical plans. We are now migrating from Spark 1.3.0 
> to 1.3.1 and found a few incompatible API changes. All of them seem to be in 
> internal code, so we understand that. But now the ResolveReferences rule, 
> that used to work with third-party logical plans just does not work, without 
> any possible workaround that I'm aware other than just copying 
> ResolveReferences rule and using it with our own fix.
> The change in question is this section of code:
> {code}
> }.headOption.getOrElse { // Only handle first case, others will be 
> fixed on the next pass.
>   sys.error(
> s"""
>   |Failure when resolving conflicting references in Join:
>   |$plan
>   |
>   |Conflicting attributes: ${conflictingAttributes.mkString(",")}
>   """.stripMargin)
> }
> {code}
> Which causes the following error on analysis:
> {code}
> Failure when resolving conflicting references in Join:
> 'Project ['l.name,'r.name,'FUNC1('l.node,'r.node) AS 
> c2#37,'FUNC2('l.node,'r.node) AS c3#38,'FUNC3('r.node,'l.node) AS c4#39]
>  'Join Inner, None
>   Subquery l
>Subquery h
> Project [name#12,node#36]
>  CustomPlan H, u, (p#13L = s#14L), [ord#15 ASC], IS NULL p#13L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#12,p#13L,s#14L,ord#15], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
>   Subquery r
>Subquery h
> Project [name#40,node#36]
>  CustomPlan H, u, (p#41L = s#42L), [ord#43 ASC], IS NULL pred#41L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#40,p#41L,s#42L,ord#43], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7088) [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7088:
---

Assignee: (was: Apache Spark)

> [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans
> -
>
> Key: SPARK-7088
> URL: https://issues.apache.org/jira/browse/SPARK-7088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Santiago M. Mola
>Priority: Critical
>  Labels: regression
>
> We're using some custom logical plans. We are now migrating from Spark 1.3.0 
> to 1.3.1 and found a few incompatible API changes. All of them seem to be in 
> internal code, so we understand that. But now the ResolveReferences rule, 
> that used to work with third-party logical plans just does not work, without 
> any possible workaround that I'm aware other than just copying 
> ResolveReferences rule and using it with our own fix.
> The change in question is this section of code:
> {code}
> }.headOption.getOrElse { // Only handle first case, others will be 
> fixed on the next pass.
>   sys.error(
> s"""
>   |Failure when resolving conflicting references in Join:
>   |$plan
>   |
>   |Conflicting attributes: ${conflictingAttributes.mkString(",")}
>   """.stripMargin)
> }
> {code}
> Which causes the following error on analysis:
> {code}
> Failure when resolving conflicting references in Join:
> 'Project ['l.name,'r.name,'FUNC1('l.node,'r.node) AS 
> c2#37,'FUNC2('l.node,'r.node) AS c3#38,'FUNC3('r.node,'l.node) AS c4#39]
>  'Join Inner, None
>   Subquery l
>Subquery h
> Project [name#12,node#36]
>  CustomPlan H, u, (p#13L = s#14L), [ord#15 ASC], IS NULL p#13L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#12,p#13L,s#14L,ord#15], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
>   Subquery r
>Subquery h
> Project [name#40,node#36]
>  CustomPlan H, u, (p#41L = s#42L), [ord#43 ASC], IS NULL pred#41L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#40,p#41L,s#42L,ord#43], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7088) [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7088:
---

Assignee: Apache Spark

> [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans
> -
>
> Key: SPARK-7088
> URL: https://issues.apache.org/jira/browse/SPARK-7088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Santiago M. Mola
>Assignee: Apache Spark
>Priority: Critical
>  Labels: regression
>
> We're using some custom logical plans. We are now migrating from Spark 1.3.0 
> to 1.3.1 and found a few incompatible API changes. All of them seem to be in 
> internal code, so we understand that. But now the ResolveReferences rule, 
> that used to work with third-party logical plans just does not work, without 
> any possible workaround that I'm aware other than just copying 
> ResolveReferences rule and using it with our own fix.
> The change in question is this section of code:
> {code}
> }.headOption.getOrElse { // Only handle first case, others will be 
> fixed on the next pass.
>   sys.error(
> s"""
>   |Failure when resolving conflicting references in Join:
>   |$plan
>   |
>   |Conflicting attributes: ${conflictingAttributes.mkString(",")}
>   """.stripMargin)
> }
> {code}
> Which causes the following error on analysis:
> {code}
> Failure when resolving conflicting references in Join:
> 'Project ['l.name,'r.name,'FUNC1('l.node,'r.node) AS 
> c2#37,'FUNC2('l.node,'r.node) AS c3#38,'FUNC3('r.node,'l.node) AS c4#39]
>  'Join Inner, None
>   Subquery l
>Subquery h
> Project [name#12,node#36]
>  CustomPlan H, u, (p#13L = s#14L), [ord#15 ASC], IS NULL p#13L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#12,p#13L,s#14L,ord#15], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
>   Subquery r
>Subquery h
> Project [name#40,node#36]
>  CustomPlan H, u, (p#41L = s#42L), [ord#43 ASC], IS NULL pred#41L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#40,p#41L,s#42L,ord#43], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exit in sql predicate

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Summary: Pruner partition won't effective when partition field and 
fieldSchema exit in sql predicate  (was: Pruner partition won't effective when 
udf exit in sql predicates)

> Pruner partition won't effective when partition field and fieldSchema exit in 
> sql predicate
> ---
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When udf exit in sql predicates, pruner partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exit in sql predicate

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Description: 
When partition field and fieldSchema exist in sql predicates, pruner partition 
won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table t_dw_qqlive_209026.



  was:
When udf exit in sql predicates, pruner partition won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table t_dw_qqlive_209026.




> Pruner partition won't effective when partition field and fieldSchema exit in 
> sql predicate
> ---
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When partition field and fieldSchema exist in sql predicates, pruner 
> partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Summary: Pruner partition won't effective when partition field and 
fieldSchema exist in sql predicate  (was: Pruner partition won't effective when 
partition field and fieldSchema exit in sql predicate)

> Pruner partition won't effective when partition field and fieldSchema exist 
> in sql predicate
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When partition field and fieldSchema exist in sql predicates, pruner 
> partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate

2015-06-17 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-8403:
-
Description: 
When partition field and fieldSchema exist in sql predicates, pruner partition 
won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
Table t_dw_qqlive_209026  is partition by imp_date, itimestamp is a 
fieldSchema in t_dw_qqlive_209026.
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table t_dw_qqlive_209026.



  was:
When partition field and fieldSchema exist in sql predicates, pruner partition 
won't effective.
Here is the sql,
{code}
select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where 
r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
{code}
When run on hive, it will only scan data in partition 20150615, but if run on 
spark sql, it will scan the whole table t_dw_qqlive_209026.




> Pruner partition won't effective when partition field and fieldSchema exist 
> in sql predicate
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>
> When partition field and fieldSchema exist in sql predicates, pruner 
> partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> Table t_dw_qqlive_209026  is partition by imp_date, itimestamp is a 
> fieldSchema in t_dw_qqlive_209026.
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8335:
---

Assignee: Apache Spark

> DecisionTreeModel.predict() return type not convenient!
> ---
>
> Key: SPARK-8335
> URL: https://issues.apache.org/jira/browse/SPARK-8335
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Sebastian Walz
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix, machine_learning
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
> def predict(features: JavaRDD[Vector]): JavaRDD[Double]
> The problem here is the generic type of the return type JAVARDD[Double] 
> because its a scala Double and I would expect a java.lang.Double. (to be 
> convenient e.g. with 
> org.apache.spark.mllib.classification.ClassificationModel)
> I wanted to extend the DecisionTreeModel and use it only for Binary 
> Classification and wanted to implement the trait 
> org.apache.spark.mllib.classification.ClassificationModel . But its not 
> possible because the ClassificationModel already defines the predict method 
> but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8335:
---

Assignee: (was: Apache Spark)

> DecisionTreeModel.predict() return type not convenient!
> ---
>
> Key: SPARK-8335
> URL: https://issues.apache.org/jira/browse/SPARK-8335
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Sebastian Walz
>Priority: Minor
>  Labels: easyfix, machine_learning
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
> def predict(features: JavaRDD[Vector]): JavaRDD[Double]
> The problem here is the generic type of the return type JAVARDD[Double] 
> because its a scala Double and I would expect a java.lang.Double. (to be 
> convenient e.g. with 
> org.apache.spark.mllib.classification.ClassificationModel)
> I wanted to extend the DecisionTreeModel and use it only for Binary 
> Classification and wanted to implement the trait 
> org.apache.spark.mllib.classification.ClassificationModel . But its not 
> possible because the ClassificationModel already defines the predict method 
> but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589492#comment-14589492
 ] 

Apache Spark commented on SPARK-8335:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6854

> DecisionTreeModel.predict() return type not convenient!
> ---
>
> Key: SPARK-8335
> URL: https://issues.apache.org/jira/browse/SPARK-8335
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Sebastian Walz
>Priority: Minor
>  Labels: easyfix, machine_learning
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
> def predict(features: JavaRDD[Vector]): JavaRDD[Double]
> The problem here is the generic type of the return type JAVARDD[Double] 
> because its a scala Double and I would expect a java.lang.Double. (to be 
> convenient e.g. with 
> org.apache.spark.mllib.classification.ClassificationModel)
> I wanted to extend the DecisionTreeModel and use it only for Binary 
> Classification and wanted to implement the trait 
> org.apache.spark.mllib.classification.ClassificationModel . But its not 
> possible because the ClassificationModel already defines the predict method 
> but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8395) spark-submit documentation is incorrect

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589493#comment-14589493
 ] 

Apache Spark commented on SPARK-8395:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6855

> spark-submit documentation is incorrect
> ---
>
> Key: SPARK-8395
> URL: https://issues.apache.org/jira/browse/SPARK-8395
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Dev Lakhani
>Priority: Minor
>
> Using a fresh checkout of 1.4.0-bin-hadoop2.6
> if you run 
> ./start-slave.sh  1 spark://localhost:7077
> you get
> failed to launch org.apache.spark.deploy.worker.Worker:
>  Default is conf/spark-defaults.conf.
>   15/06/16 13:11:08 INFO Utils: Shutdown hook called
> it seems the worker number is not being accepted  as desccribed here:
> https://spark.apache.org/docs/latest/spark-standalone.html
> The documentation says:
> ./sbin/start-slave.sh  
> but the start.slave-sh script states:
> usage="Usage: start-slave.sh  where  is 
> like spark://localhost:7077"
> I have checked for similar issues using :
> https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22
> and found nothing similar so am raising this as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8395) spark-submit documentation is incorrect

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8395:
---

Assignee: (was: Apache Spark)

> spark-submit documentation is incorrect
> ---
>
> Key: SPARK-8395
> URL: https://issues.apache.org/jira/browse/SPARK-8395
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Dev Lakhani
>Priority: Minor
>
> Using a fresh checkout of 1.4.0-bin-hadoop2.6
> if you run 
> ./start-slave.sh  1 spark://localhost:7077
> you get
> failed to launch org.apache.spark.deploy.worker.Worker:
>  Default is conf/spark-defaults.conf.
>   15/06/16 13:11:08 INFO Utils: Shutdown hook called
> it seems the worker number is not being accepted  as desccribed here:
> https://spark.apache.org/docs/latest/spark-standalone.html
> The documentation says:
> ./sbin/start-slave.sh  
> but the start.slave-sh script states:
> usage="Usage: start-slave.sh  where  is 
> like spark://localhost:7077"
> I have checked for similar issues using :
> https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22
> and found nothing similar so am raising this as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8395) spark-submit documentation is incorrect

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8395:
---

Assignee: Apache Spark

> spark-submit documentation is incorrect
> ---
>
> Key: SPARK-8395
> URL: https://issues.apache.org/jira/browse/SPARK-8395
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Dev Lakhani
>Assignee: Apache Spark
>Priority: Minor
>
> Using a fresh checkout of 1.4.0-bin-hadoop2.6
> if you run 
> ./start-slave.sh  1 spark://localhost:7077
> you get
> failed to launch org.apache.spark.deploy.worker.Worker:
>  Default is conf/spark-defaults.conf.
>   15/06/16 13:11:08 INFO Utils: Shutdown hook called
> it seems the worker number is not being accepted  as desccribed here:
> https://spark.apache.org/docs/latest/spark-standalone.html
> The documentation says:
> ./sbin/start-slave.sh  
> but the start.slave-sh script states:
> usage="Usage: start-slave.sh  where  is 
> like spark://localhost:7077"
> I have checked for similar issues using :
> https://issues.apache.org/jira/browse/SPARK-6552?jql=text%20~%20%22start-slave%22
> and found nothing similar so am raising this as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Nathan McCarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589498#comment-14589498
 ] 

Nathan McCarthy commented on SPARK-8406:


This is hitting us hard. Let me know if there is anything we can do to help on 
this end with contributing a fix or testing. 

FYI heres details from the mailing list. 

 ---

When trying to save a data frame with 569610608 rows. 

  dfc.write.format("parquet").save(“/data/map_parquet_file")

We get random results between runs. Caching the data frame in memory makes no 
difference. It looks like the write out misses some of the RDD partitions. We 
have an RDD with 6750 partitions. When we write out we get less files out than 
the number of partitions. When reading the data back in and running a count, we 
get smaller number of rows. 

I’ve tried counting the rows in all different ways. All return the same result, 
560214031 rows, missing about 9.4 million rows (0.15%).

  qc.read.parquet("/data/map_parquet_file").count
  qc.read.parquet("/data/map_parquet_file").rdd.count
  qc.read.parquet("/data/map_parquet_file").mapPartitions{itr => var c = 0; 
itr.foreach(_ => c = c + 1); Seq(c).toIterator }.reduce(_ + _)

Looking on HDFS the files, there are 6643 .parquet files. 107 missing 
partitions (about 0.15%). 

Then writing out the same cached DF again to a new file gives 6717 files on 
hdfs (about 33 files missing or 0.5%);

  dfc.write.parquet(“/data/map_parquet_file_2")

And we get 566670107 rows back (about 3million missing ~0.5%); 

  qc.read.parquet("/data/map_parquet_file_2").count

Writing the same df out to json writes the expected number (6750) of parquet 
files and returns the right number of rows 569610608. 

  dfc.write.format("json").save("/data/map_parquet_file_3")
  qc.read.format("json").load("/data/map_parquet_file_3").count

One thing to note is that the parquet part files on HDFS are not the normal 
sequential part numbers like for the json output and parquet output in Spark 
1.3.

part-r-06151.gz.parquet  part-r-118401.gz.parquet  part-r-146249.gz.parquet  
part-r-196755.gz.parquet  part-r-35811.gz.parquet   part-r-55628.gz.parquet  
part-r-73497.gz.parquet  part-r-97237.gz.parquet
part-r-06161.gz.parquet  part-r-118406.gz.parquet  part-r-146254.gz.parquet  
part-r-196763.gz.parquet  part-r-35826.gz.parquet   part-r-55647.gz.parquet  
part-r-73500.gz.parquet  _SUCCESS

We are using MapR 4.0.2 for hdfs.

> Race condition when writing Parquet files
> -
>
> Key: SPARK-8406
> URL: https://issues.apache.org/jira/browse/SPARK-8406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> To support appending, the Parquet data source tries to find out the max ID of 
> part-files in the destination directory (the  in output file name 
> "part-r-.gz.parquet") at the beginning of the write job. In 1.3.0, this 
> step happens on driver side before any files are written. However, in 1.4.0, 
> this is moved to task side. Thus, for tasks scheduled later, they may see 
> wrong max ID generated by newly written files by other finished tasks within 
> the same job. This actually causes a race condition. In most cases, this only 
> causes nonconsecutive IDs in output file names. But when the DataFrame 
> contains thousands of RDD partitions, it's likely that two tasks may choose 
> the same ID, thus one of them gets overwritten by the other.
> The data loss situation is not quite easy to reproduce. But the following 
> Spark shell snippet can reproduce nonconsecutive output file IDs:
> {code}
> sqlContext.range(0, 
> 128).repartition(16).write.mode("overwrite").parquet("foo")
> {code}
> "16" can be replaced with any integer that is greater than the default 
> parallelism on your machine (usually it means core number, on my machine it's 
> 8).
> {noformat}
> -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
> /user/lian/foo/_SUCCESS
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-1.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-2.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-3.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-4.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-5.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-6.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-7.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/par

[jira] [Resolved] (SPARK-8309) OpenHashMap doesn't work with more than 12M items

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8309.
--
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0
   1.3.2

Issue resolved by pull request 6763
[https://github.com/apache/spark/pull/6763]

> OpenHashMap doesn't work with more than 12M items
> -
>
> Key: SPARK-8309
> URL: https://issues.apache.org/jira/browse/SPARK-8309
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Vyacheslav Baranov
> Fix For: 1.3.2, 1.5.0, 1.4.1
>
>
> The problem might be demonstrated with the following testcase:
> {code}
>   test("support for more than 12M items") {
> val cnt = 1200 // 12M
> val map = new OpenHashMap[Int, Int](cnt)
> for (i <- 0 until cnt) {
>   map(i) = 1
> }
> val numInvalidValues = map.iterator.count(_._2 == 0)
> assertResult(0)(numInvalidValues)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8309) OpenHashMap doesn't work with more than 12M items

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8309:
-
Assignee: Vyacheslav Baranov

> OpenHashMap doesn't work with more than 12M items
> -
>
> Key: SPARK-8309
> URL: https://issues.apache.org/jira/browse/SPARK-8309
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Vyacheslav Baranov
>Assignee: Vyacheslav Baranov
> Fix For: 1.3.2, 1.4.1, 1.5.0
>
>
> The problem might be demonstrated with the following testcase:
> {code}
>   test("support for more than 12M items") {
> val cnt = 1200 // 12M
> val map = new OpenHashMap[Int, Int](cnt)
> for (i <- 0 until cnt) {
>   map(i) = 1
> }
> val numInvalidValues = map.iterator.count(_._2 == 0)
> assertResult(0)(numInvalidValues)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8309) OpenHashMap doesn't work with more than 12M items

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8309:
-
 Priority: Critical  (was: Major)
Affects Version/s: 1.3.1

> OpenHashMap doesn't work with more than 12M items
> -
>
> Key: SPARK-8309
> URL: https://issues.apache.org/jira/browse/SPARK-8309
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Vyacheslav Baranov
>Assignee: Vyacheslav Baranov
>Priority: Critical
> Fix For: 1.3.2, 1.4.1, 1.5.0
>
>
> The problem might be demonstrated with the following testcase:
> {code}
>   test("support for more than 12M items") {
> val cnt = 1200 // 12M
> val map = new OpenHashMap[Int, Int](cnt)
> for (i <- 0 until cnt) {
>   map(i) = 1
> }
> val numInvalidValues = map.iterator.count(_._2 == 0)
> assertResult(0)(numInvalidValues)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7667:
---
Description: 
Check and ensure the MLlib Python API(class/method/parameter) consistent with 
Scala.

The following APIs are not consistent:
* class
* method
** recommendation.MatrixFactorizationModel.predictAll() (Because it's a public 
API, so not change it)
* parameter
** feature.StandardScaler.fit()
** many transform() function of feature module

  was:
Check and ensure the MLlib Python API(class/method/parameter) consistent with 
Scala.

The following APIs are not consistent:
* class
* method
** recommendation.MatrixFactorizationModel.predictAll()
* parameter
** feature.StandardScaler.fit()
** many transform() function of feature module


> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll() (Because it's a 
> public API, so not change it)
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7667:
---

Assignee: Apache Spark

> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll()
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7667:
---

Assignee: (was: Apache Spark)

> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll()
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589565#comment-14589565
 ] 

Apache Spark commented on SPARK-7667:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6856

> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll()
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7667) MLlib Python API consistency check

2015-06-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589587#comment-14589587
 ] 

Yanbo Liang commented on SPARK-7667:


[~josephkb] I have finished the check and list the major inconsistency at this 
JIRA.

> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> ** recommendation.MatrixFactorizationModel.predictAll() (Because it's a 
> public API, so not change it)
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8406) Race condition when writing Parquet files

2015-06-17 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589592#comment-14589592
 ] 

Cheng Lian commented on SPARK-8406:
---

An example task completion and scheduling order that causes overwriting:

# Writing a DataFrame with 4 RDD partitions to an empty directory.
# Task 1 and task 2 get scheduled, while task 3 and task 4 are queued.  Both 
task 1 and task 2 find current max part number to be 0 (because destination 
directory is empty).
# Task 1 finishes, generates {{part-r-1.gz.parquet}}. Current max part 
number becomes 1.
# Task 4 gets scheduled, decides to write to {{part-r-5.gz.parquet}} (5 = 
current max part number + task ID), but hasn't start writing the file yet.
# Task 2 finishes, generates {{part-r-2.gz.parquet}}. Current max part 
number becomes 2.
# Task 3 gets scheduled, also decides to write to {{part-r-5.gz.parquet}} 
since task 4 hasn't start writing its output file, and task 3 finds current max 
part number is still 2.
# Task 4 finishes writing {{part-r-5.gz.parquet}}
# Task 3 finishes writing {{part-r-5.gz.parquet}}
# Output of task 4 is overwritten.

> Race condition when writing Parquet files
> -
>
> Key: SPARK-8406
> URL: https://issues.apache.org/jira/browse/SPARK-8406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> To support appending, the Parquet data source tries to find out the max ID of 
> part-files in the destination directory (the  in output file name 
> "part-r-.gz.parquet") at the beginning of the write job. In 1.3.0, this 
> step happens on driver side before any files are written. However, in 1.4.0, 
> this is moved to task side. Thus, for tasks scheduled later, they may see 
> wrong max ID generated by newly written files by other finished tasks within 
> the same job. This actually causes a race condition. In most cases, this only 
> causes nonconsecutive IDs in output file names. But when the DataFrame 
> contains thousands of RDD partitions, it's likely that two tasks may choose 
> the same ID, thus one of them gets overwritten by the other.
> The data loss situation is not quite easy to reproduce. But the following 
> Spark shell snippet can reproduce nonconsecutive output file IDs:
> {code}
> sqlContext.range(0, 
> 128).repartition(16).write.mode("overwrite").parquet("foo")
> {code}
> "16" can be replaced with any integer that is greater than the default 
> parallelism on your machine (usually it means core number, on my machine it's 
> 8).
> {noformat}
> -rw-r--r--   3 lian supergroup  0 2015-06-17 00:06 
> /user/lian/foo/_SUCCESS
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-1.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-2.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-3.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-4.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-5.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-6.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-7.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-8.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00017.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00018.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00019.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00020.gz.parquet
> -rw-r--r--   3 lian supergroup352 2015-06-17 00:06 
> /user/lian/foo/part-r-00021.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00022.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00023.gz.parquet
> -rw-r--r--   3 lian supergroup353 2015-06-17 00:06 
> /user/lian/foo/part-r-00024.gz.parquet
> {noformat}
> Notice that the newly added ORC data source doesn't suffer this issue because 
> it uses both task ID and {{System.currentTimeMills()}} to generate the output 
> file name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.o

[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589596#comment-14589596
 ] 

Apache Spark commented on SPARK-8333:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/6858

> Spark failed to delete temp directory created by HiveContext
> 
>
> Key: SPARK-8333
> URL: https://issues.apache.org/jira/browse/SPARK-8333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Windows7 64bit
>Reporter: sheng
>Priority: Minor
>  Labels: Hive, metastore, sparksql
>
> Spark 1.4.0 failed to stop SparkContext.
> {code:title=LocalHiveTest.scala|borderStyle=solid}
>  val sc = new SparkContext("local", "local-hive-test", new SparkConf())
>  val hc = Utils.createHiveContext(sc)
>  ... // execute some HiveQL statements
>  sc.stop()
> {code}
> sc.stop() failed to execute, it threw the following exception:
> {quote}
> 15/06/13 03:19:06 INFO Utils: Shutdown hook called
> 15/06/13 03:19:06 INFO Utils: Deleting directory 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> java.io.IOException: Failed to delete: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
>   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {quote}
> It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local 
> hive metastore is created in a temp directory. The problem is the local hive 
> metastore is not shut down correctly. At the end of application,  if 
> SparkContext.stop() is called, it tries to delete the temp directory which is 
> still used by the local hive metastore, and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Nirman Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589612#comment-14589612
 ] 

Nirman Narang commented on SPARK-8320:
--

This seems interesting, would like to solve this issue.


> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8391) showDagViz throws OutOfMemoryError, cause the whole jobPage dies

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8391:
---

Assignee: Apache Spark

> showDagViz throws OutOfMemoryError, cause the whole jobPage dies
> 
>
> Key: SPARK-8391
> URL: https://issues.apache.org/jira/browse/SPARK-8391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Assignee: Apache Spark
>
> When the job is big, and has so many DAG nodes and edges.showDagViz throws 
> ERROR, then the whole jobPage render is down. I think it's unsuitable. An 
> element node can't down the whole page.
> Below is the exception stack trace:
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:207)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$.makeDotFile(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:389)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:385)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.ui.UIUtils$.showDagViz(UIUtils.scala:385)
> at org.apache.spark.ui.UIUtils$.showDagVizForJob(UIUtils.scala:363)
> at org.apache.spark.ui.jobs.JobPage.render(JobPage.scala:317)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:75)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
> at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
> at 
> com.huawei.spark.web.filter.SessionTimeoutFilter.doFilter(SessionTimeoutFilter.java:80)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8391) showDagViz throws OutOfMemoryError, cause the whole jobPage dies

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8391:
---

Assignee: (was: Apache Spark)

> showDagViz throws OutOfMemoryError, cause the whole jobPage dies
> 
>
> Key: SPARK-8391
> URL: https://issues.apache.org/jira/browse/SPARK-8391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>
> When the job is big, and has so many DAG nodes and edges.showDagViz throws 
> ERROR, then the whole jobPage render is down. I think it's unsuitable. An 
> element node can't down the whole page.
> Below is the exception stack trace:
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:207)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$.makeDotFile(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:389)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:385)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.ui.UIUtils$.showDagViz(UIUtils.scala:385)
> at org.apache.spark.ui.UIUtils$.showDagVizForJob(UIUtils.scala:363)
> at org.apache.spark.ui.jobs.JobPage.render(JobPage.scala:317)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:75)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
> at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
> at 
> com.huawei.spark.web.filter.SessionTimeoutFilter.doFilter(SessionTimeoutFilter.java:80)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8391) showDagViz throws OutOfMemoryError, cause the whole jobPage dies

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589614#comment-14589614
 ] 

Apache Spark commented on SPARK-8391:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6859

> showDagViz throws OutOfMemoryError, cause the whole jobPage dies
> 
>
> Key: SPARK-8391
> URL: https://issues.apache.org/jira/browse/SPARK-8391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>
> When the job is big, and has so many DAG nodes and edges.showDagViz throws 
> ERROR, then the whole jobPage render is down. I think it's unsuitable. An 
> element node can't down the whole page.
> Below is the exception stack trace:
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:207)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$$anonfun$makeDotFile$1.apply(RDDOperationGraph.scala:171)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.scope.RDDOperationGraph$.makeDotFile(RDDOperationGraph.scala:171)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:389)
> at 
> org.apache.spark.ui.UIUtils$$anonfun$showDagViz$1.apply(UIUtils.scala:385)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.ui.UIUtils$.showDagViz(UIUtils.scala:385)
> at org.apache.spark.ui.UIUtils$.showDagVizForJob(UIUtils.scala:363)
> at org.apache.spark.ui.jobs.JobPage.render(JobPage.scala:317)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79)
> at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:75)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
> at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
> at 
> com.huawei.spark.web.filter.SessionTimeoutFilter.doFilter(SessionTimeoutFilter.java:80)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
> at 
> org.jasig.cas.client.util.HttpServletRequestWrapperFilter.doFilter(HttpServletRequestWrapperFilter.java:75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type

2015-06-17 Thread JIRA
Felix Maximilian Möller created SPARK-8408:
--

 Summary: Python OR operator is not considered while creating a 
column of boolean type
 Key: SPARK-8408
 URL: https://issues.apache.org/jira/browse/SPARK-8408
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: OSX Apache Spark 1.4.0
Reporter: Felix Maximilian Möller
Priority: Minor



Given
=

.. code:: python

d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
person_df = sqlContext.createDataFrame(d)
When


.. code:: python

person_df.filter(person_df.age==1 or person_df.age==2).collect()
Expected


[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]

Actual
==

[Row(age=1, name=u'Alice')]

While
=

.. code:: python

person_df.filter("age = 1 or age = 2").collect()
yields the correct result:

[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type

2015-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Maximilian Möller updated SPARK-8408:
---
Attachment: bug_report.ipynb.json

IPython notebook with the code that reflects the bug.

> Python OR operator is not considered while creating a column of boolean type
> 
>
> Key: SPARK-8408
> URL: https://issues.apache.org/jira/browse/SPARK-8408
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: OSX Apache Spark 1.4.0
>Reporter: Felix Maximilian Möller
>Priority: Minor
> Attachments: bug_report.ipynb.json
>
>
> Given
> =
> .. code:: python
> d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
> person_df = sqlContext.createDataFrame(d)
> When
> 
> .. code:: python
> person_df.filter(person_df.age==1 or person_df.age==2).collect()
> Expected
> 
> [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]
> Actual
> ==
> [Row(age=1, name=u'Alice')]
> While
> =
> .. code:: python
> person_df.filter("age = 1 or age = 2").collect()
> yields the correct result:
> [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8408) Python OR operator is not considered while creating a column of boolean type

2015-06-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Maximilian Möller updated SPARK-8408:
---
Description: 
h3. Given

{code}
d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
person_df = sqlContext.createDataFrame(d)
{code}
h3. When
{code}
person_df.filter(person_df.age==1 or person_df.age==2).collect()
{code}
h3. Expected

[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]

h3. Actual

[Row(age=1, name=u'Alice')]

h3. While
{code}
person_df.filter("age = 1 or age = 2").collect()
{code}
yields the correct result:

[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]


  was:

Given
=

.. code:: python

d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
person_df = sqlContext.createDataFrame(d)
When


.. code:: python

person_df.filter(person_df.age==1 or person_df.age==2).collect()
Expected


[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]

Actual
==

[Row(age=1, name=u'Alice')]

While
=

.. code:: python

person_df.filter("age = 1 or age = 2").collect()
yields the correct result:

[Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]



> Python OR operator is not considered while creating a column of boolean type
> 
>
> Key: SPARK-8408
> URL: https://issues.apache.org/jira/browse/SPARK-8408
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: OSX Apache Spark 1.4.0
>Reporter: Felix Maximilian Möller
>Priority: Minor
> Attachments: bug_report.ipynb.json
>
>
> h3. Given
> {code}
> d = [{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}]
> person_df = sqlContext.createDataFrame(d)
> {code}
> h3. When
> {code}
> person_df.filter(person_df.age==1 or person_df.age==2).collect()
> {code}
> h3. Expected
> [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]
> h3. Actual
> [Row(age=1, name=u'Alice')]
> h3. While
> {code}
> person_df.filter("age = 1 or age = 2").collect()
> {code}
> yields the correct result:
> [Row(age=1, name=u'Alice'), Row(age=2, name=u'Bob')]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-06-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589659#comment-14589659
 ] 

Steve Loughran commented on SPARK-7009:
---

The issue isn't that python can't read large JARs, it's that it can't read the 
header for large JARs

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8409) i cant able to read .csv files using read.df() in sparkR of spark 1.4 for eg.) mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", header="false")

2015-06-17 Thread Arun (JIRA)
Arun created SPARK-8409:
---

 Summary:  i cant able to read .csv files using read.df() in sparkR 
of spark 1.4 for eg.) mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", 
source="json", header="false") 
 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical


Hi, 
In SparkR shell, I invoke: 
> mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> header="false") 
I have tried various filetypes (csv, txt), all fail.   

RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
BELOW THE WHOLE RESPONSE: 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
curMem=0, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 173.4 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
curMem=177600, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.2 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
NativeMethodAccessorImpl.java:-2 
15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded. 
15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
java.lang.reflect.InvocationTargetException 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
 

[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-06-17 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589663#comment-14589663
 ] 

John Omernik commented on SPARK-7819:
-

Yin -

This occurred with my 1.4.0 release (not an earlier release, the official 
release) I had to put the prefixes in my spark-defaults.conf for it to work. Is 
this the official way to work around the problem or will something be changing 
in the code so the prefixes are not required?

Thanks



> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: invalidClassException.log, stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8409) In windows cant able to read .csv files using read.df() in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-ha

2015-06-17 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8409:

Summary:  In windows cant able to read .csv files using read.df() in sparkR 
of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
"E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
 source = "csv")  (was:  i cant able to read .csv files using read.df() in 
sparkR of spark 1.4 for eg.) mydf<-read.df(sqlContext, 
"/home/esten/ami/usaf.json", source="json", header="false") )

>  In windows cant able to read .csv files using read.df() in sparkR of spark 
> 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> 
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> 

[jira] [Updated] (SPARK-8409) In windows cant able to read .csv or .json files using read.df() in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4

2015-06-17 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8409:

Summary:  In windows cant able to read .csv or .json files using read.df() 
in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
"E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
 source = "csv")  (was:  In windows cant able to read .csv files using 
read.df() in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
"E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
 source = "csv"))

>  In windows cant able to read .csv or .json files using read.df() in sparkR 
> of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys

[jira] [Updated] (SPARK-8409) In windows cant able to read .csv or .json files using read.df() in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4

2015-06-17 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8409:

Description: 
Hi, 
In SparkR shell, I invoke: 
> mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> header="false") 
I have tried various filetypes (csv, txt), all fail.   

RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
BELOW THE WHOLE RESPONSE: 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
curMem=0, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 173.4 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
curMem=177600, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.2 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
NativeMethodAccessorImpl.java:-2 
15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded. 
15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
java.lang.reflect.InvocationTargetException 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.s

[jira] [Updated] (SPARK-8409) In windows cant able to read .csv or .json files using read.df() in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4

2015-06-17 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8409:

Description: 
Hi, 
In SparkR shell, I invoke: 
> mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> header="false") 
I have tried various filetypes (csv, txt), all fail.   

RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
BELOW THE WHOLE RESPONSE: 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
curMem=0, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 173.4 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
curMem=177600, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.2 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
NativeMethodAccessorImpl.java:-2 
15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded. 
15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
java.lang.reflect.InvocationTargetException 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.s

[jira] [Commented] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system

2015-06-17 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589699#comment-14589699
 ] 

Jeff Zhang commented on SPARK-6653:
---

Although this is already committed, would it be better to specify a port range ?



> New configuration property to specify port for sparkYarnAM actor system
> ---
>
> Key: SPARK-6653
> URL: https://issues.apache.org/jira/browse/SPARK-6653
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.1
> Environment: Spark On Yarn
>Reporter: Manoj Samel
>Assignee: Shekhar Bansal
>Priority: Minor
> Fix For: 1.4.0
>
>
> In 1.3.0 code line sparkYarnAM actor system is started on random port. See 
> org.apache.spark.deploy.yarn ApplicationMaster.scala:282
> actorSystem = AkkaUtils.createActorSystem("sparkYarnAM", Utils.localHostName, 
> 0, conf = sparkConf, securityManager = securityMgr)._1
> This may be issue when ports between Spark client and the Yarn cluster are 
> limited by firewall and not all ports are open between client and Yarn 
> cluster.
> Proposal is to introduce new property spark.am.actor.port and change code to
> val port = sparkConf.getInt("spark.am.actor.port", 0)
> actorSystem = AkkaUtils.createActorSystem("sparkYarnAM", 
> Utils.localHostName, port,
>   conf = sparkConf, securityManager = securityMgr)._1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-17 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589736#comment-14589736
 ] 

John Omernik commented on SPARK-1403:
-

This is occurring still in 1.4.0

I've attempted to look through the code to determine what may have changed but 
the Class Loading code has shifted around quite a bit, and I could not pinpoint 
when the change, or which update changed the code to break it again.  If there 
is anything I can do to help troubleshoot, please advise. 



> Spark on Mesos does not set Thread's context class loader
> -
>
> Key: SPARK-1403
> URL: https://issues.apache.org/jira/browse/SPARK-1403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.3.0, 1.4.0
> Environment: ubuntu 12.04 on vagrant
>Reporter: Bharath Bhushan
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
> executor on mesos slave throws a  java.lang.ClassNotFoundException for 
> org.apache.spark.serializer.JavaSerializer.
> The lengthy discussion is here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7688) PySpark + ipython throws port out of range exception

2015-06-17 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589747#comment-14589747
 ] 

Manoj Kumar commented on SPARK-7688:


I faced the same issue with a very old version of IPython. It went, when I 
upgraded it.

A note should be made somewhere, so that it could be useful for people.

> PySpark + ipython throws port out of range exception
> 
>
> Key: SPARK-7688
> URL: https://issues.apache.org/jira/browse/SPARK-7688
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>Priority: Critical
>
> Saw this error when I set PYSPARK_PYTHON to "ipython" and ran `bin/pyspark`:
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: port out of 
> range:458964785
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:185)
>   at java.net.Socket.(Socket.java:241)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> My env (OS X 10.10):
> {code}
>  ipython
> Python 2.7.9 |Anaconda 2.2.0 (x86_64)| (default, Dec 15 2014, 10:37:34)
> Type "copyright", "credits" or "license" for more information.
> IPython 3.0.0 -- An enhanced Interactive Python.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8410) Hive VersionsSuite RuntimeException

2015-06-17 Thread Josiah Samuel Sathiadass (JIRA)
Josiah Samuel Sathiadass created SPARK-8410:
---

 Summary: Hive VersionsSuite RuntimeException
 Key: SPARK-8410
 URL: https://issues.apache.org/jira/browse/SPARK-8410
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.4.0, 1.3.1
 Environment: IBM Power system - P7
running Ubuntu 14.04LE
with IBM JDK version 1.7.0
Reporter: Josiah Samuel Sathiadass
Priority: Minor


While testing Spark Project Hive, there are RuntimeExceptions as follows,

VersionsSuite:
- success sanity check *** FAILED ***
  java.lang.RuntimeException: [download failed: 
org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: 
org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: 
asm#asm;3.2!asm.jar]
  at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
  at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
  at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44)
  ...

The tests are executed with the following set of options,

build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 
test

Adding the following dependencies in the "spark/sql/hive/pom.xml"  file solves 
this issue,

<   
<   org.jboss.netty
<   netty
<   3.2.2.Final
<   test
<   
<   
<   org.codehaus.groovy
<   groovy-all
<   2.1.6
<   test
<   
< 
<   
<   asm
<   asm
<   3.2
<   test
<   
< 

The question is, Is this the correct way to fix this runtimeException ?
If yes, Can a pull request fix this issue permanently ?
If not, suggestions please.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-17 Thread Olivier Girardot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Girardot updated SPARK-8332:

Priority: Critical  (was: Major)

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 versio. But when I run a simple WordCount demo, it 
> throws NoSuchMethodError "java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer". 
> I found the default "fasterxml.jackson.version" is 2.4.4. It's there any 
> wrong or conflict with the jackson version? Or is there possible some project 
> maven dependency contains wrong version jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-17 Thread Olivier Girardot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589903#comment-14589903
 ] 

Olivier Girardot commented on SPARK-8332:
-

I have the same issue on a CDH5 cluster

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 versio. But when I run a simple WordCount demo, it 
> throws NoSuchMethodError "java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer". 
> I found the default "fasterxml.jackson.version" is 2.4.4. It's there any 
> wrong or conflict with the jackson version? Or is there possible some project 
> maven dependency contains wrong version jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-17 Thread Olivier Girardot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589903#comment-14589903
 ] 

Olivier Girardot edited comment on SPARK-8332 at 6/17/15 3:12 PM:
--

I have the same issue on a CDH5 cluster trying to use spark-submit


was (Author: ogirardot):
I have the same issue on a CDH5 cluster

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 versio. But when I run a simple WordCount demo, it 
> throws NoSuchMethodError "java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer". 
> I found the default "fasterxml.jackson.version" is 2.4.4. It's there any 
> wrong or conflict with the jackson version? Or is there possible some project 
> maven dependency contains wrong version jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8409:
-
Summary:  In windows cant able to read .csv or .json files using read.df()  
(was:  In windows cant able to read .csv or .json files using read.df() in 
sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
"E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
 source = "csv"))

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: hdfs://smalldata13.hdp:8020/home/esten

[jira] [Updated] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8409:
-
Description: 
Hi, 
In SparkR shell, I invoke: 
> mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> header="false") 
I have tried various filetypes (csv, txt), all fail.   

 in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
"E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
 source = "csv")

RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
BELOW THE WHOLE RESPONSE: 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
curMem=0, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 173.4 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
curMem=177600, maxMem=278302556 
15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.2 KB, free 265.2 MB) 
15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
NativeMethodAccessorImpl.java:-2 
15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded. 
15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
java.lang.reflect.InvocationTargetException 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfu

[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589906#comment-14589906
 ] 

Shivaram Venkataraman commented on SPARK-8409:
--

The error here is that the file it is looking for is 
'hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json' and is not found. Please 
copy the file out to HDFS for it to work correctly.

To use CSV you will need to add the Spark CSV package to SparkR and 
https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85#file-dataframe_example-r-L6
 has some instructions for this

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
>  in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  
> at 
> io.netty.u

[jira] [Updated] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-06-17 Thread Harry Brundage (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harry Brundage updated SPARK-7009:
--
Attachment: check_spark_python.sh

Script for checking to see if a spark release artifact jar can be imported by 
Python for use as spark.yarn.jar

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
> Attachments: check_spark_python.sh
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8411) No space left on device

2015-06-17 Thread Mukund Sudarshan (JIRA)
Mukund Sudarshan created SPARK-8411:
---

 Summary: No space left on device
 Key: SPARK-8411
 URL: https://issues.apache.org/jira/browse/SPARK-8411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Mukund Sudarshan


com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on 
device

This is the error I get when trying to run a program on my cluster. It doesn't 
occur when I run it locally however. My cluster is certainly not out of space



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-06-17 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589913#comment-14589913
 ] 

Harry Brundage commented on SPARK-7009:
---

[~sowen] I wrote a script to test that we can all run to prove if it's 
importable or not and check my assumptions. Both the 1.4 hadoop-2.6 and 
hadoop-provided artifacts fail to be imported by python, whereas the 1.3.1 
hadoop-2.6 artifact works fine for me on my machine. Script attached.

If this is the case for other, then we're in a spot of trouble, no? 

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
> Attachments: check_spark_python.sh
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems

2015-06-17 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589913#comment-14589913
 ] 

Harry Brundage edited comment on SPARK-7009 at 6/17/15 3:22 PM:


[~sowen] I wrote a script to test that we can all run to prove if it's 
importable or not and check my assumptions. Both the 1.4 hadoop-2.6 and 
hadoop-provided artifacts fail to be imported by python, whereas the 1.3.1 
hadoop-2.6 artifact works fine for me on my machine. Script attached.

If this is the case for others, then we're in a spot of trouble, no? 


was (Author: airhorns):
[~sowen] I wrote a script to test that we can all run to prove if it's 
importable or not and check my assumptions. Both the 1.4 hadoop-2.6 and 
hadoop-provided artifacts fail to be imported by python, whereas the 1.3.1 
hadoop-2.6 artifact works fine for me on my machine. Script attached.

If this is the case for other, then we're in a spot of trouble, no? 

> Build assembly JAR via ant to avoid zip64 problems
> --
>
> Key: SPARK-7009
> URL: https://issues.apache.org/jira/browse/SPARK-7009
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
> Environment: Java 7+
>Reporter: Steve Loughran
> Attachments: check_spark_python.sh
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a 
> format incompatible with Java and pyspark.
> Provided the total number of .class files+resources is <64K, ant can be used 
> to make the final JAR instead, perhaps by unzipping the maven-generated JAR 
> then rezipping it with zip64=never, before publishing the artifact via maven.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-06-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6698.
--
Resolution: Won't Fix

Closing per PR

> RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
> --
>
> Key: SPARK-6698
> URL: https://issues.apache.org/jira/browse/SPARK-6698
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Michael Bieniosek
>Priority: Minor
>
> In RandomForest.scala the feature input is persisted with 
> StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
> rate is set at 100%.  This forces the RDD to be stored unserialized, which 
> causes major JVM GC headaches if the RDD is sizable.  
> Something similar happens in NodeIdCache.scala though I believe in this case 
> the RDD is smaller.
> A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7199) Add date and timestamp support to UnsafeRow

2015-06-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7199.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5984
[https://github.com/apache/spark/pull/5984]

> Add date and timestamp support to UnsafeRow
> ---
>
> Key: SPARK-7199
> URL: https://issues.apache.org/jira/browse/SPARK-7199
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 1.5.0
>
>
> We should add date and timestamp support to UnsafeRow.  This should be fairly 
> easy, as both data types are fixed-length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8320:
---

Assignee: Apache Spark

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589990#comment-14589990
 ] 

Neelesh Srinivas Salian commented on SPARK-8320:


Added python example:
https://github.com/apache/spark/pull/6860

Thank you.

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8320:
---

Assignee: (was: Apache Spark)

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589991#comment-14589991
 ] 

Apache Spark commented on SPARK-8320:
-

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/6860

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-06-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590010#comment-14590010
 ] 

Yin Huai commented on SPARK-7819:
-

[~mandoskippy] Yeah, it is the official way to work around the problem. Since 
1.4.0, we introduced isolated class loaders to make Spark SQL be able to 
connect to different versions of Hive metastore. Basically, our metastore 
client part is using a different class loader than the other part of the Spark 
SQL. I am attaching the doc for this conf 
({{spark.sql.hive.metastore.sharedPrefixes}}), hope this is helpful.

{quote}
A comma separated list of class prefixes that should be loaded using the 
classloader that is shared between Spark SQL and a specific version of Hive. An 
example of classes that should be shared is JDBC drivers that are needed to 
talk to the metastore. Other classes that need to be shared are those that 
interact with classes that are already shared.  For example, custom appenders 
that are used by log4j.
{quote}


> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: invalidClassException.log, stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590040#comment-14590040
 ] 

Neelesh Srinivas Salian commented on SPARK-8320:


Deleted that PR.

Added https://github.com/apache/spark/pull/6861


> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590072#comment-14590072
 ] 

Yin Huai commented on SPARK-8368:
-

Can you add {{--verbose}} and post the extra information like {{Main class:}} 
and {{Classpath elements:}}? 

[~andrewor14] Have we changed any thing related to spark submit in 1.4.0?

> ClassNotFoundException in closure for map 
> --
>
> Key: SPARK-8368
> URL: https://issues.apache.org/jira/browse/SPARK-8368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
> project on Windows 7 and run in a spark standalone cluster(or local) mode on 
> Centos 6.X. 
>Reporter: CHEN Zhiwei
>
> After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
> following exception:
> ==begin exception
> {quote}
> Exception in thread "main" java.lang.ClassNotFoundException: 
> com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:278)
>   at 
> org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
>   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
>   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
>   at com.yhd.ycache.magic.Model.main(SSExample.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> ===end exception===
> I simplify the code that cause this issue, as following:
> ==begin code==
> {noformat}
> object Model extends Serializable{
>   def main(args: Array[String]) {
> val Array(sql) = args
> val sparkConf = new SparkConf().setAppName("Mode Example")
> val sc = new SparkContext(sparkConf)
> val hive = new HiveContext(sc)
> //get data by hive sql
> val rows = hive.sql(sql)
> val data = rows.map(r => { 
>   val arr = r.toSeq.toArray
>   val label = 1.0
>   def fmap = ( input: Any ) => 1.0
>   val feature = arr.map(_=>1.0)
>   LabeledPoint(label, Vectors.dense(feature))
> })
> data.count()
>   }
> }
> {noformat}
> =end code===
> This code can run pretty well on spark-shell, but error when submit it to 
> spark cluster (standalone or local mode).  I try the same code on spark 
> 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsub

[jira] [Issue Comment Deleted] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Neelesh Srinivas Salian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian updated SPARK-8320:
---
Comment: was deleted

(was: Added python example:
https://github.com/apache/spark/pull/6860

Thank you.)

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590076#comment-14590076
 ] 

Apache Spark commented on SPARK-8320:
-

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/6862

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-17 Thread Neelesh Srinivas Salian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian updated SPARK-8320:
---
Comment: was deleted

(was: Deleted that PR.

Added https://github.com/apache/spark/pull/6861
)

> Add example in streaming programming guide that shows union of multiple input 
> streams
> -
>
> Key: SPARK-8320
> URL: https://issues.apache.org/jira/browse/SPARK-8320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> The section on "Level of Parallelism in Data Receiving" has a Scala and a 
> Java example for union of multiple input streams. A python example should be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Arun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590100#comment-14590100
 ] 

Arun commented on SPARK-8409:
-

Thanks sivram, I have doubts to get clarified.
1.) how can we increase the space of the spark standalone cluster, by default 
its having around 234mb only. Am using spark 1.4 sparkR in windows environment. 
When I browsed it told to edit on spark env.sh , but what code I have to put on.
2.) I have an hoton works standalone cluster (10.200.202.85:8020), can we make 
it as master node in spark content. If yes what's the code in windows env.
3.) Or keeping spark cluster as master can we connect Hortonworks standalone 
cluster as worker node. If possible what's the code in win environment. If am 
wrong kidly regret.

TIA,
Arun Gunalan

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
>  in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioE

[jira] [Resolved] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8409.
--
Resolution: Not A Problem

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
>  in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.ja

[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590116#comment-14590116
 ] 

Shivaram Venkataraman commented on SPARK-8409:
--

[~b.arunguna...@gmail.com] These are good questions to post to the Spark user 
mailing list or StackOverflow (See http://spark.apache.org/community.html for 
details).  I'm closing this issue for now as the JIRA is more suitable for 
development issues / reporting bugs that need to fixed in Spark.

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
>  in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(Default

[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-17 Thread Arun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590119#comment-14590119
 ] 

Arun commented on SPARK-8409:
-

Ok thanks a lot shriram

>  In windows cant able to read .csv or .json files using read.df()
> -
>
> Key: SPARK-8409
> URL: https://issues.apache.org/jira/browse/SPARK-8409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: sparkR API
>Reporter: Arun
>Priority: Critical
>  Labels: build
>
> Hi, 
> In SparkR shell, I invoke: 
> > mydf<-read.df(sqlContext, "/home/esten/ami/usaf.json", source="json", 
> > header="false") 
> I have tried various filetypes (csv, txt), all fail.   
>  in sparkR of spark 1.4 for eg.) df_1<- read.df(sqlContext, 
> "E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv",
>  source = "csv")
> RESPONSE: "ERROR RBackendHandler: load on 1 failed" 
> BELOW THE WHOLE RESPONSE: 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
> curMem=0, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 173.4 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
> curMem=177600, maxMem=278302556 
> 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.2 KB, free 265.2 MB) 
> 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
> 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
> NativeMethodAccessorImpl.java:-2 
> 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded. 
> 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
> java.lang.reflect.InvocationTargetException 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
>  
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>  
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>  
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>  
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(Fi

[jira] [Created] (SPARK-8412) java#KafkaUtils.createDirectStream Java(Pair)RDDs do not implement HasOffsetRanges

2015-06-17 Thread jweinste (JIRA)
jweinste created SPARK-8412:
---

 Summary: java#KafkaUtils.createDirectStream Java(Pair)RDDs do not 
implement HasOffsetRanges
 Key: SPARK-8412
 URL: https://issues.apache.org/jira/browse/SPARK-8412
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: jweinste
Priority: Critical


// Create direct kafka stream with brokers and topics
final JavaPairInputDStream messages = 
KafkaUtils.createDirectStream(jssc, String.class, String.class, 
StringDecoder.class,
StringDecoder.class, kafkaParams, topics);

messages.foreachRDD(new Function, 
Void>() {
@Override
public Void call(final JavaPairRDD rdd) throws 
Exception {
if (rdd instanceof HasOffsetRanges) {
//will never happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8365:
-
Priority: Blocker  (was: Major)

> pyspark does not retain --packages or --jars passed on the command line as of 
> 1.4.0
> ---
>
> Key: SPARK-8365
> URL: https://issues.apache.org/jira/browse/SPARK-8365
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Don Drake
>Priority: Blocker
>
> I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
> Python Spark application against it and got the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
> : java.lang.RuntimeException: Failed to load class for data source: 
> com.databricks.spark.csv
> I pass the following on the command-line to my spark-submit:
> --packages com.databricks:spark-csv_2.10:1.0.3
> This worked fine on 1.3.1, but not in 1.4.
> I was able to replicate it with the following pyspark:
> {code}
> a = {'a':1.0, 'b':'asdf'}
> rdd = sc.parallelize([a])
> df = sqlContext.createDataFrame(rdd)
> df.save("/tmp/d.csv", "com.databricks.spark.csv")
> {code}
> Even using the new 
> df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
> error. 
> I see it was added in the web UI:
> file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> Thoughts?
> *I also attempted using the Scala spark-shell to load a csv using the same 
> package and it worked just fine, so this seems specific to pyspark.*
> -Don
> Gory details:
> {code}
> $ pyspark --packages "com.databricks:spark-csv_2.10:1.0.3"
> Python 2.7.6 (default, Sep  9 2014, 15:04:36)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Ivy Default Cache set to: /Users/drake/.ivy2/cache
> The jars for the packages stored in: /Users/drake/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.10 added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found com.databricks#spark-csv_2.10;1.0.3 in central
>   found org.apache.commons#commons-csv;1.1 in central
> :: resolution report :: resolve 590ms :: artifacts dl 17ms
>   :: modules in use:
>   com.databricks#spark-csv_2.10;1.0.3 from central in [default]
>   org.apache.commons#commons-csv;1.1 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   2   |   0   |   0   |   0   ||   2   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 2 already retrieved (0kB/15ms)
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
> 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
> SCDynamicStore
> 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local 
> resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on 
> interface en0)
> 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(drake); users 
> with modify permissions: Set(drake)
> 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
> 15/06/13 11:06:10 INFO Remoting: Starting remoting
> 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.0.222:56870]
> 15/06/13 11:06:10 INFO Utils: Successfully sta

[jira] [Updated] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8365:
-
Target Version/s: 1.4.1, 1.5.0

> pyspark does not retain --packages or --jars passed on the command line as of 
> 1.4.0
> ---
>
> Key: SPARK-8365
> URL: https://issues.apache.org/jira/browse/SPARK-8365
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Don Drake
>Priority: Blocker
>
> I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
> Python Spark application against it and got the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
> : java.lang.RuntimeException: Failed to load class for data source: 
> com.databricks.spark.csv
> I pass the following on the command-line to my spark-submit:
> --packages com.databricks:spark-csv_2.10:1.0.3
> This worked fine on 1.3.1, but not in 1.4.
> I was able to replicate it with the following pyspark:
> {code}
> a = {'a':1.0, 'b':'asdf'}
> rdd = sc.parallelize([a])
> df = sqlContext.createDataFrame(rdd)
> df.save("/tmp/d.csv", "com.databricks.spark.csv")
> {code}
> Even using the new 
> df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
> error. 
> I see it was added in the web UI:
> file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> Thoughts?
> *I also attempted using the Scala spark-shell to load a csv using the same 
> package and it worked just fine, so this seems specific to pyspark.*
> -Don
> Gory details:
> {code}
> $ pyspark --packages "com.databricks:spark-csv_2.10:1.0.3"
> Python 2.7.6 (default, Sep  9 2014, 15:04:36)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Ivy Default Cache set to: /Users/drake/.ivy2/cache
> The jars for the packages stored in: /Users/drake/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.10 added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found com.databricks#spark-csv_2.10;1.0.3 in central
>   found org.apache.commons#commons-csv;1.1 in central
> :: resolution report :: resolve 590ms :: artifacts dl 17ms
>   :: modules in use:
>   com.databricks#spark-csv_2.10;1.0.3 from central in [default]
>   org.apache.commons#commons-csv;1.1 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   2   |   0   |   0   |   0   ||   2   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 2 already retrieved (0kB/15ms)
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
> 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
> SCDynamicStore
> 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local 
> resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on 
> interface en0)
> 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(drake); users 
> with modify permissions: Set(drake)
> 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
> 15/06/13 11:06:10 INFO Remoting: Starting remoting
> 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.0.222:56870]
> 15/06/13 11:06:10 INFO Utils: Successfully star

[jira] [Commented] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-17 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590150#comment-14590150
 ] 

Andrew Or commented on SPARK-8365:
--

Bumping to blocker because this seems like a regression.

> pyspark does not retain --packages or --jars passed on the command line as of 
> 1.4.0
> ---
>
> Key: SPARK-8365
> URL: https://issues.apache.org/jira/browse/SPARK-8365
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Don Drake
>Priority: Blocker
>
> I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
> Python Spark application against it and got the following error:
> py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
> : java.lang.RuntimeException: Failed to load class for data source: 
> com.databricks.spark.csv
> I pass the following on the command-line to my spark-submit:
> --packages com.databricks:spark-csv_2.10:1.0.3
> This worked fine on 1.3.1, but not in 1.4.
> I was able to replicate it with the following pyspark:
> {code}
> a = {'a':1.0, 'b':'asdf'}
> rdd = sc.parallelize([a])
> df = sqlContext.createDataFrame(rdd)
> df.save("/tmp/d.csv", "com.databricks.spark.csv")
> {code}
> Even using the new 
> df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
> error. 
> I see it was added in the web UI:
> file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
> By User
> http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar   Added 
> By User
> Thoughts?
> *I also attempted using the Scala spark-shell to load a csv using the same 
> package and it worked just fine, so this seems specific to pyspark.*
> -Don
> Gory details:
> {code}
> $ pyspark --packages "com.databricks:spark-csv_2.10:1.0.3"
> Python 2.7.6 (default, Sep  9 2014, 15:04:36)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Ivy Default Cache set to: /Users/drake/.ivy2/cache
> The jars for the packages stored in: /Users/drake/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.10 added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found com.databricks#spark-csv_2.10;1.0.3 in central
>   found org.apache.commons#commons-csv;1.1 in central
> :: resolution report :: resolve 590ms :: artifacts dl 17ms
>   :: modules in use:
>   com.databricks#spark-csv_2.10;1.0.3 from central in [default]
>   org.apache.commons#commons-csv;1.1 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   2   |   0   |   0   |   0   ||   2   |   0   |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   0 artifacts copied, 2 already retrieved (0kB/15ms)
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
> 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
> SCDynamicStore
> 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local 
> resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on 
> interface en0)
> 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
> 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(drake); users 
> with modify permissions: Set(drake)
> 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
> 15/06/13 11:06:10 INFO Remoting: Starting remoting
> 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp:/

  1   2   3   4   >