[jira] [Commented] (SPARK-5456) Decimal Type comparison issue

2015-03-27 Thread Karthik G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383392#comment-14383392
 ] 

Karthik G commented on SPARK-5456:
--

This is a blocker when using Spark with Databases which have Decimal / Big 
Decimal columns. Is there a workaround?

 Decimal Type comparison issue
 -

 Key: SPARK-5456
 URL: https://issues.apache.org/jira/browse/SPARK-5456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Kuldeep

 Not quite able to figure this out but here is a junit test to reproduce this, 
 in JavaAPISuite.java
 {code:title=DecimalBug.java}
   @Test
   public void decimalQueryTest() {
 ListRow decimalTable = new ArrayListRow();
 decimalTable.add(RowFactory.create(new BigDecimal(1), new 
 BigDecimal(2)));
 decimalTable.add(RowFactory.create(new BigDecimal(3), new 
 BigDecimal(4)));
 JavaRDDRow rows = sc.parallelize(decimalTable);
 ListStructField fields = new ArrayListStructField(7);
 fields.add(DataTypes.createStructField(a, 
 DataTypes.createDecimalType(), true));
 fields.add(DataTypes.createStructField(b, 
 DataTypes.createDecimalType(), true));
 sqlContext.applySchema(rows.rdd(), 
 DataTypes.createStructType(fields)).registerTempTable(foo);
 Assert.assertEquals(sqlContext.sql(select * from foo where a  
 0).collectAsList(), decimalTable);
   }
 {code}
 Fails with
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.spark.sql.types.Decimal



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) better support for working with missing data

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Description: 
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column (or all columns) with a fixed 
value.

2. Support replacing a set of values with another set of values.

3. interpolate


  was:
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column with a fixed value.

2. Support replacing all null value for all columns with a fixed value.



 better support for working with missing data
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.
 Some ideas:
 1. Support replacing all null value for a column (or all columns) with a 
 fixed value.
 2. Support replacing a set of values with another set of values.
 3. interpolate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6406) Launcher backward compatibility issues

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6406:
---

Assignee: (was: Apache Spark)

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Nishkam Ravi
Priority: Minor

 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6561:


 Summary: Add partition support in saveAsParquet
 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang


Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquet(path: String, partitionColumns: Seq[String])
{code}

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) better support for working with missing data

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Labels: DataFrame  (was: )

 better support for working with missing data
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) better support for working with missing data

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Summary: better support for working with missing data  (was: missing data 
support)

 better support for working with missing data
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) better support for working with missing data

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Description: 
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column with a fixed value.

2. Support replacing all null value for all columns with a fixed value.


  was:
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.




 better support for working with missing data
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.
 Some ideas:
 1. Support replacing all null value for a column with a fixed value.
 2. Support replacing all null value for all columns with a fixed value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-6561:
-
Description: 
Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquetFile(path: String, partitionColumns: Seq[String])
{code}

Jianshi

  was:
Now ParquetRelation2 supports automatic partition discovery which is very nice. 

When we save a DataFrame into Parquet files, we also want to have it 
partitioned.

The proposed API looks like this:

{code}
def saveAsParquet(path: String, partitionColumns: Seq[String])
{code}

Jianshi


 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383413#comment-14383413
 ] 

Patrick Wendell commented on SPARK-6561:


FYI - I just removed Affects Version's since that is only for bugs (to 
indicate which version has the bug).

 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6561:
---
Affects Version/s: (was: 1.3.1)
   (was: 1.3.0)

 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) better support for working with missing data

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Description: 
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column (or all columns) with a fixed 
value.

2. Support dropping rows with null values (dropna).

3. Support replacing a set of values with another set of values (i.e. map join)



  was:
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column (or all columns) with a fixed 
value.

2. Support replacing a set of values with another set of values.

3. interpolate



 better support for working with missing data
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.
 Some ideas:
 1. Support replacing all null value for a column (or all columns) with a 
 fixed value.
 2. Support dropping rows with null values (dropna).
 3. Support replacing a set of values with another set of values (i.e. map 
 join)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6406) Launcher backward compatibility issues

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6406:
---

Assignee: Apache Spark

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Nishkam Ravi
Assignee: Apache Spark
Priority: Minor

 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6341) Upgrade breeze from 0.11.1 to 0.11.2 or later

2015-03-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6341.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5222
[https://github.com/apache/spark/pull/5222]

 Upgrade breeze from 0.11.1 to 0.11.2 or later
 -

 Key: SPARK-6341
 URL: https://issues.apache.org/jira/browse/SPARK-6341
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.3.0
Reporter: Yu Ishikawa
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 There is a bug to divide a breeze sparse vector which has any zero values 
 with a scalar value. However, this bug is in breeze's side. I heard that once 
 David fixes it and publishes it to maven, we can upgrade to breeze 0.11.2 or 
 later.
 - [Apache Spark Developers List: Is there any bugs to divide a Breeze sparse 
 vector at Spark 
 v1.3.0-rc3](http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-td11056.html)
 - [Is there any bugs to divide a sparse vector with `:/` at v0.11.1? · Issue 
 #382 · 
 scalanlp/breeze](https://github.com/scalanlp/breeze/issues/382#issuecomment-80896698)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6341) Upgrade breeze from 0.11.1 to 0.11.2 or later

2015-03-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6341:
-
Assignee: Yu Ishikawa

 Upgrade breeze from 0.11.1 to 0.11.2 or later
 -

 Key: SPARK-6341
 URL: https://issues.apache.org/jira/browse/SPARK-6341
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.3.0
Reporter: Yu Ishikawa
Assignee: Yu Ishikawa
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 There is a bug to divide a breeze sparse vector which has any zero values 
 with a scalar value. However, this bug is in breeze's side. I heard that once 
 David fixes it and publishes it to maven, we can upgrade to breeze 0.11.2 or 
 later.
 - [Apache Spark Developers List: Is there any bugs to divide a Breeze sparse 
 vector at Spark 
 v1.3.0-rc3](http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-td11056.html)
 - [Is there any bugs to divide a sparse vector with `:/` at v0.11.1? · Issue 
 #382 · 
 scalanlp/breeze](https://github.com/scalanlp/breeze/issues/382#issuecomment-80896698)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled

2015-03-27 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-6443:

Priority: Critical  (was: Major)

 Could not submit app in standalone cluster mode when HA is enabled
 --

 Key: SPARK-6443
 URL: https://issues.apache.org/jira/browse/SPARK-6443
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Tao Wang
Priority: Critical

 After digging some codes, I found user could not submit app in standalone 
 cluster mode when HA is enabled. But in client mode it can work.
 Haven't try yet. But I will verify this and file a PR to resolve it if the 
 problem exists.
 3/23 update:
 I started a HA cluster with zk, and tried to submit SparkPi example with 
 command:
 ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
 spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
 ../lib/spark-examples-1.2.0-hadoop2.4.0.jar 
 and it failed with error message:
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
 spark://doggie153:7077,doggie159:7077
 akka.actor.ActorInitializationException: exception during creation
 at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
 at akka.actor.ActorCell.create(ActorCell.scala:596)
 at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
 at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
 at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://doggie153:7077,doggie159:7077
 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
 at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
 at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
 at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
 at akka.actor.ActorCell.create(ActorCell.scala:580)
 ... 9 more
 But in client mode it ended with correct result. So my guess is right. I will 
 fix it in the related PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2015-03-27 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-6567:
-

 Summary: Large linear model parallelism via a join and reduceByKey
 Key: SPARK-6567
 URL: https://issues.apache.org/jira/browse/SPARK-6567
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh


To train a linear model, each training point in the training set needs its dot 
product computed against the model, per iteration. If the model is large (too 
large to fit in memory on a single machine) then SPARK-4590 proposes using 
parameter server.

There is an easier way to achieve this without parameter servers. In 
particular, if the data is held as a BlockMatrix and the model as an RDD, then 
each block can be joined with the relevant part of the model, followed by a 
reduceByKey to compute the dot products.

This obviates the need for a parameter server, at least for linear models. 
However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path

2015-03-27 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-6568:


 Summary: spark-shell.cmd --jars option does not accept the jar 
that has space in its path
 Key: SPARK-6568
 URL: https://issues.apache.org/jira/browse/SPARK-6568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Windows
Affects Versions: 1.3.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI


spark-shell.cmd --jars option does not accept the jar that has space in its 
path.
The path of jar sometimes containes space in Windows.

{code}
bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
{code}
this gets
{code}
Exception in thread main java.net.URISyntaxException: Illegal character in 
path at index 10: C:/Program Files/some/jar1.jar
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Platon Potapov (JIRA)
Platon Potapov created SPARK-6569:
-

 Summary: Kafka directInputStream logs what appear to be incorrect 
warnings
 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor


During what appears to be normal operation of streaming from a Kafka topic, the 
following log records are observed, logged periodically:

[Stage 391:==  (3 + 0) / 4]
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0

* the ${part.fromOffset} is not correctly substituted to a value
* is the condition really mandates a warning logged?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5111:
---

Assignee: (was: Apache Spark)

 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5111:
---

Assignee: Apache Spark

 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang
Assignee: Apache Spark

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-03-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383627#comment-14383627
 ] 

Cheng Lian commented on SPARK-6566:
---

Hi [~k.shaposhni...@gmail.com], as described in SPARK-5463, we do want to 
upgrade Parquet. However, currently we have two concerns:
# The most recent Parquet RC release introduces subtle API incompatibilities 
related to filter push-down and Parquet metadata gathering, which I believe 
requires more work than the patch you provided if we want everything works 
perfectly with the best performance.
# We'd like to wait for the official release of Parquet 1.6.0. This is the 
first release for Parquet as an Apache top-level project, so it takes more time 
than usual.
We probably will first try to upgrade to a most recent 1.6.0 RC release in 
Spark master, and then switch to the official 1.6.0 release in Spark 1.4.0 (and 
Spark 1.3.2 if there will be one).

 Update Spark to use the latest version of Parquet libraries
 ---

 Key: SPARK-6566
 URL: https://issues.apache.org/jira/browse/SPARK-6566
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov

 There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). 
 E.g. PARQUET-136
 It would be good to update Spark to use the latest parquet version.
 The following changes are required:
 {code}
 diff --git a/pom.xml b/pom.xml
 index 5ad39a9..095b519 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -132,7 +132,7 @@
  !-- Version used for internal directory structure --
  hive.version.short0.13.1/hive.version.short
  derby.version10.10.1.1/derby.version
 -parquet.version1.6.0rc3/parquet.version
 +parquet.version1.6.0rc7/parquet.version
  jblas.version1.2.3/jblas.version
  jetty.version8.1.14.v20131031/jetty.version
  orbit.version3.0.0.v201112011016/orbit.version
 {code}
 and
 {code}
 --- 
 a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 +++ 
 b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
  globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
mergedMetadata, globalMetaData.getCreatedBy)
  
 -val readContext = getReadSupport(configuration).init(
 +val readContext = 
 ParquetInputFormat.getReadSupportInstance(configuration).init(
new InitContext(configuration,
  globalMetaData.getKeyValueMetaData,
  globalMetaData.getSchema))
 {code}
 I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Platon Potapov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6569:
--
Description: 
During what appears to be normal operation of streaming from a Kafka topic, the 
following log records are observed, logged periodically:

{code}
[Stage 391:==  (3 + 0) / 4]
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
{code}

* the part.fromOffset placeholder is not correctly substituted to a value
* is the condition really mandates a warning logged?


  was:
During what appears to be normal operation of streaming from a Kafka topic, the 
following log records are observed, logged periodically:

[Stage 391:==  (3 + 0) / 4]
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0

* the ${part.fromOffset} is not correctly substituted to a value
* is the condition really mandates a warning logged?



 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Platon Potapov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6569:
--
Description: 
During what appears to be normal operation of streaming from a Kafka topic, the 
following log records are observed, logged periodically:

{code}
[Stage 391:==  (3 + 0) / 4]
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
{code}

* the part.fromOffset placeholder is not correctly substituted to a value
* is the condition really mandates a warning being logged?


  was:
During what appears to be normal operation of streaming from a Kafka topic, the 
following log records are observed, logged periodically:

{code}
[Stage 391:==  (3 + 0) / 4]
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
same as ending offset skipping raw 0
{code}

* the part.fromOffset placeholder is not correctly substituted to a value
* is the condition really mandates a warning logged?



 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6548:
---

Assignee: (was: Apache Spark)

 Adding stddev to DataFrame functions
 

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter
 Fix For: 1.4.0


 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6548:
---

Assignee: Apache Spark

 Adding stddev to DataFrame functions
 

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
  Labels: DataFrame, starter
 Fix For: 1.4.0


 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383635#comment-14383635
 ] 

Apache Spark commented on SPARK-6548:
-

User 'dreamquster' has created a pull request for this issue:
https://github.com/apache/spark/pull/5228

 Adding stddev to DataFrame functions
 

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter
 Fix For: 1.4.0


 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-27 Thread sdfox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383650#comment-14383650
 ] 

sdfox commented on SPARK-6489:
--

I am interesting in this.

 Optimize lateral view with explode to not read unnecessary columns
 --

 Key: SPARK-6489
 URL: https://issues.apache.org/jira/browse/SPARK-6489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov
  Labels: starter

 Currently a query with lateral view explode(...) results in an execution 
 plan that reads all columns of the underlying RDD.
 E.g. given *ppl* table is DF created from Person case class:
 {code}
 case class Person(val name: String, val age: Int, val data: Array[Int])
 {code}
 the following SQL:
 {code}
 select name, sum(d) from ppl lateral view explode(data) d as d group by name
 {code}
 executes as follows:
 {noformat}
 == Physical Plan ==
 Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
  Exchange (HashPartitioning [name#0], 200)
   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
 PartialSum#38L]
Project [name#0,d#21]
 Generate explode(data#2), true, false
  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
 [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
 (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:35), Some(ppl))
 {noformat}
 Note that *age* column is not needed to produce the output but it is still 
 read from the underlying RDD.
 A sample program to demonstrate the issue:
 {code}
 case class Person(val name: String, val age: Int, val data: Array[Int])
 object ExplodeDemo extends App {
   val ppl = Array(
 Person(A, 20, Array(10, 12, 19)),
 Person(B, 25, Array(7, 8, 4)),
 Person(C, 19, Array(12, 4, 232)))
   
   val conf = new SparkConf().setMaster(local[2]).setAppName(sql)
   val sc = new SparkContext(conf)
   val sqlCtx = new HiveContext(sc)
   import sqlCtx.implicits._
   val df = sc.makeRDD(ppl).toDF
   df.registerTempTable(ppl)
   sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used 
 that do not support column pruning
   val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) 
 d as d group by name)
   s.explain(true)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row

2015-03-27 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6564:
--

 Summary: SQLContext.emptyDataFrame should contain 0 rows, not 1 row
 Key: SPARK-6564
 URL: https://issues.apache.org/jira/browse/SPARK-6564
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Right now emptyDataFrame actually contains 1 row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383599#comment-14383599
 ] 

Masayoshi TSUZUKI commented on SPARK-6435:
--

Release 1.3.0 works fine.
But the problem occurs in the latest script in master branch (under developing 
for 1.4).


 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6562) DataFrame.replace value support

2015-03-27 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6562:
--

 Summary: DataFrame.replace value support
 Key: SPARK-6562
 URL: https://issues.apache.org/jira/browse/SPARK-6562
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Support replacing a set of values with another set of values (i.e. map join), 
similar to Pandas' replace.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) DataFrame.dropna support

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Summary: DataFrame.dropna support  (was: better support for working with 
missing data)

 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Real world data can be messy. An important feature of data frames is support 
 for missing data. We should figure out what we want to do here.
 Some ideas:
 1. Support replacing all null value for a column (or all columns) with a 
 fixed value.
 2. Support dropping rows with null values (dropna).
 3. Support replacing a set of values with another set of values (i.e. map 
 join)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6563) DataFrame.fillna

2015-03-27 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6563:
--

 Summary: DataFrame.fillna
 Key: SPARK-6563
 URL: https://issues.apache.org/jira/browse/SPARK-6563
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Support replacing all null value for a column (or all columns) with a fixed 
value.

Similar to Pandas' fillna.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6119) DataFrame.dropna support

2015-03-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6119:
---
Description: 
Support dropping rows with null values (dropna). Similar to Pandas' dropna

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html


  was:
Real world data can be messy. An important feature of data frames is support 
for missing data. We should figure out what we want to do here.

Some ideas:

1. Support replacing all null value for a column (or all columns) with a fixed 
value.

2. Support dropping rows with null values (dropna).

3. Support replacing a set of values with another set of values (i.e. map join)




 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Support dropping rows with null values (dropna). Similar to Pandas' dropna
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383560#comment-14383560
 ] 

Masayoshi TSUZUKI commented on SPARK-6435:
--

I looked into the script of the latest version and unfortunately found that it 
doesn't work properly too.
We have the same symptom when we specify multiple jars with --jars option in 
spark-shell.cmd, but the cause is different.

These work fine.
{code}
bin\spark-shell.cmd --jars C:\jar1.jar
bin\spark-shell.cmd --jars C:\jar1.jar
{code}

But this doesn't work.
{code}
bin\spark-shell.cmd --jars C:\jar1.jar,C:\jar2.jar
{code}
this gets
{code}
Exception in thread main java.net.URISyntaxException: Illegal character in 
path at index 11: C:/jar1.jar C:/jar2.jar
{code}


 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-03-27 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383550#comment-14383550
 ] 

Masayoshi TSUZUKI commented on SPARK-5389:
--

Hmm... sorry it seems different case from what I expected.
And I still have no idea how to reproduce it.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6119) DataFrame.dropna support

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383491#comment-14383491
 ] 

Apache Spark commented on SPARK-6119:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5225

 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Support dropping rows with null values (dropna). Similar to Pandas' dropna
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6119) DataFrame.dropna support

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6119:
---

Assignee: (was: Apache Spark)

 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Support dropping rows with null values (dropna). Similar to Pandas' dropna
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6119) DataFrame.dropna support

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6119:
---

Assignee: Apache Spark

 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
  Labels: DataFrame

 Support dropping rows with null values (dropna). Similar to Pandas' dropna
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6564:
---

Assignee: Apache Spark  (was: Reynold Xin)

 SQLContext.emptyDataFrame should contain 0 rows, not 1 row
 --

 Key: SPARK-6564
 URL: https://issues.apache.org/jira/browse/SPARK-6564
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 Right now emptyDataFrame actually contains 1 row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6564:
---

Assignee: Reynold Xin  (was: Apache Spark)

 SQLContext.emptyDataFrame should contain 0 rows, not 1 row
 --

 Key: SPARK-6564
 URL: https://issues.apache.org/jira/browse/SPARK-6564
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Right now emptyDataFrame actually contains 1 row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6564) SQLContext.emptyDataFrame should contain 0 rows, not 1 row

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383495#comment-14383495
 ] 

Apache Spark commented on SPARK-6564:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5226

 SQLContext.emptyDataFrame should contain 0 rows, not 1 row
 --

 Key: SPARK-6564
 URL: https://issues.apache.org/jira/browse/SPARK-6564
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Right now emptyDataFrame actually contains 1 row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF

2015-03-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6565:
-

 Summary: Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
 Key: SPARK-6565
 URL: https://issues.apache.org/jira/browse/SPARK-6565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the 
original name becomes confusing. Would be better to deprecate it and add 
{{jsonDataFrame}} or {{jsonDF}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6406) Launcher backward compatibility issues

2015-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383519#comment-14383519
 ] 

Sean Owen commented on SPARK-6406:
--

(Might want to update the title and description to reflect what this is really 
about now; I wasn't 100% sure what the latest intent was.)

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Nishkam Ravi
Priority: Minor

 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383601#comment-14383601
 ] 

Apache Spark commented on SPARK-6435:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/5227

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6435:
---

Assignee: (was: Apache Spark)

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6435:
---

Assignee: Apache Spark

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay
Assignee: Apache Spark

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-03-27 Thread Konstantin Shaposhnikov (JIRA)
Konstantin Shaposhnikov created SPARK-6566:
--

 Summary: Update Spark to use the latest version of Parquet 
libraries
 Key: SPARK-6566
 URL: https://issues.apache.org/jira/browse/SPARK-6566
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov


There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. 
PARQUET-136

It would be good to update Spark to use the latest parquet version.

The following changes are required:
{code}
diff --git a/pom.xml b/pom.xml
index 5ad39a9..095b519 100644
--- a/pom.xml
+++ b/pom.xml
@@ -132,7 +132,7 @@
 !-- Version used for internal directory structure --
 hive.version.short0.13.1/hive.version.short
 derby.version10.10.1.1/derby.version
-parquet.version1.6.0rc3/parquet.version
+parquet.version1.6.0rc7/parquet.version
 jblas.version1.2.3/jblas.version
 jetty.version8.1.14.v20131031/jetty.version
 orbit.version3.0.0.v201112011016/orbit.version
{code}
and
{code}
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
@@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
 globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
   mergedMetadata, globalMetaData.getCreatedBy)
 
-val readContext = getReadSupport(configuration).init(
+val readContext = 
ParquetInputFormat.getReadSupportInstance(configuration).init(
   new InitContext(configuration,
 globalMetaData.getKeyValueMetaData,
 globalMetaData.getSchema))

{code}

I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383566#comment-14383566
 ] 

vijay commented on SPARK-6435:
--

Strange - when I test it with multiple jars (with the fixed script) everything 
works

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383566#comment-14383566
 ] 

vijay edited comment on SPARK-6435 at 3/27/15 9:17 AM:
---

Strange - when I test it with multiple jars (with the fixed script) everything 
works.
Something has changed in some other script wrt the released 1.3.0


was (Author: vjapache):
Strange - when I test it with multiple jars (with the fixed script) everything 
works

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6255) Python MLlib API missing items: Classification

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6255:
---

Assignee: Apache Spark  (was: Yanbo Liang)

 Python MLlib API missing items: Classification
 --

 Key: SPARK-6255
 URL: https://issues.apache.org/jira/browse/SPARK-6255
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 LogisticRegressionWithLBFGS
 * setNumClasses
 * setValidateData
 LogisticRegressionModel
 * getThreshold
 * numClasses
 * numFeatures
 SVMWithSGD
 * setValidateData
 SVMModel
 * getThreshold



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6255) Python MLlib API missing items: Classification

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6255:
---

Assignee: Yanbo Liang  (was: Apache Spark)

 Python MLlib API missing items: Classification
 --

 Key: SPARK-6255
 URL: https://issues.apache.org/jira/browse/SPARK-6255
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 LogisticRegressionWithLBFGS
 * setNumClasses
 * setValidateData
 LogisticRegressionModel
 * getThreshold
 * numClasses
 * numFeatures
 SVMWithSGD
 * setValidateData
 SVMModel
 * getThreshold



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5563) LDA with online variational inference

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5563:
---

Assignee: Apache Spark  (was: yuhao yang)

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5563) LDA with online variational inference

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5563:
---

Assignee: yuhao yang  (was: Apache Spark)

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383729#comment-14383729
 ] 

Sean Owen commented on SPARK-6435:
--

OK, [~vjapache] would you like to submit a PR that changes to use the brackets? 
You may need two PRs, one for branch-1.3 and one for master, since some 
occurrences are now gone in master.

[~tsudukim] OK I understand your pull request is to fix a similar issue but in 
the new 1.4 / master code?

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6556:
-
Affects Version/s: 1.4.0

 Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
 HeartbeatReceiver
 

 Key: SPARK-6556
 URL: https://issues.apache.org/jira/browse/SPARK-6556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.4.0


 The current reading logic of executorTimeoutMs is:
 {code}
 private val executorTimeoutMs = sc.conf.getLong(spark.network.timeout, 
 sc.conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, 120)) * 1000
 {code}
 So if spark.storage.blockManagerSlaveTimeoutMs is 1 and 
 spark.network.timeout is not set, executorTimeoutMs will be 1 * 1000. 
 But the correct value should have been 1. 
 checkTimeoutIntervalMs has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6556.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Shixiong Zhu

 Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
 HeartbeatReceiver
 

 Key: SPARK-6556
 URL: https://issues.apache.org/jira/browse/SPARK-6556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.4.0


 The current reading logic of executorTimeoutMs is:
 {code}
 private val executorTimeoutMs = sc.conf.getLong(spark.network.timeout, 
 sc.conf.getLong(spark.storage.blockManagerSlaveTimeoutMs, 120)) * 1000
 {code}
 So if spark.storage.blockManagerSlaveTimeoutMs is 1 and 
 spark.network.timeout is not set, executorTimeoutMs will be 1 * 1000. 
 But the correct value should have been 1. 
 checkTimeoutIntervalMs has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5155) Python API for MQTT streaming

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5155:
---

Assignee: Prabeesh K  (was: Apache Spark)

 Python API for MQTT streaming
 -

 Key: SPARK-5155
 URL: https://issues.apache.org/jira/browse/SPARK-5155
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Prabeesh K

 Python API for MQTT Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5155) Python API for MQTT streaming

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5155:
---

Assignee: Apache Spark  (was: Prabeesh K)

 Python API for MQTT streaming
 -

 Key: SPARK-5155
 URL: https://issues.apache.org/jira/browse/SPARK-5155
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Apache Spark

 Python API for MQTT Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6569.
--
Resolution: Duplicate

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6535) new RDD function that returns intermediate Future

2015-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6535.
--
Resolution: Not a Problem

I think it's fair to say that this would not require a change to Spark to 
implement the desired functionality, so closing it.

 new RDD function that returns intermediate Future
 -

 Key: SPARK-6535
 URL: https://issues.apache.org/jira/browse/SPARK-6535
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Eric Johnston
Priority: Minor
  Labels: features, newbie
   Original Estimate: 168h
  Remaining Estimate: 168h

 I'm suggesting a possible Spark RDD method that I think could give value to a 
 number of people. I'd be interested in thoughts and feedback. Is this a good 
 or bad idea in general? Will it work well, but is too specific for Spark-Core?
 def mapIO[V : ClassTag](f1 : T = Future[U], f2 : U = V, batchSize : Int) : 
 RDD[V]
 The idea is that often times we have an RDD[T] containing metadata, for 
 example a file path or a unique identifier to data in an external database. 
 We would like to retrieve this data, process it, and provide the output as an 
 RDD. Right now, one way to do that is with two map calls: the first being T 
 = U, followed by U = V. However, this will block on all T = U IO 
 operations. By wrapping U in a Future, this problem is avoided. The 
 batchSize is added because we do not want to create a future for every row 
 in a partition -- we may get too much data back at once. The batchSize limits 
 the number of outstanding Futures within a partition. Ideally this number is 
 set to be big enough so that there is always data ready to process, but small 
 enough that not too much data is pulled at any one time. We could potentially 
 default the batchSize to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6440) ipv6 URI for HttpServer

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6440:
---

Assignee: (was: Apache Spark)

 ipv6 URI for HttpServer
 ---

 Key: SPARK-6440
 URL: https://issues.apache.org/jira/browse/SPARK-6440
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
Reporter: Arsenii Krasikov
Priority: Minor

 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// 
 + localHostname + : + masterPort{code}, where {{localHostname}} is 
 {code:java} org.apache.spark.util.Utils.localHostName() = 
 customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
 ipv6 address then it would be interpolated into invalid URI:  
 {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
 {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
 The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6440) ipv6 URI for HttpServer

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6440:
---

Assignee: Apache Spark

 ipv6 URI for HttpServer
 ---

 Key: SPARK-6440
 URL: https://issues.apache.org/jira/browse/SPARK-6440
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
 Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
Reporter: Arsenii Krasikov
Assignee: Apache Spark
Priority: Minor

 In {{org.apache.spark.HttpServer}} uri is generated as {code:java}spark:// 
 + localHostname + : + masterPort{code}, where {{localHostname}} is 
 {code:java} org.apache.spark.util.Utils.localHostName() = 
 customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
 ipv6 address then it would be interpolated into invalid URI:  
 {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
 {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
 The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6558:
---

Assignee: Apache Spark  (was: Thomas Graves)

 Utils.getCurrentUserName returns the full principal name instead of login name
 --

 Key: SPARK-6558
 URL: https://issues.apache.org/jira/browse/SPARK-6558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Thomas Graves
Assignee: Apache Spark
Priority: Critical

 Utils.getCurrentUserName returns 
 UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't 
 set.  It should return 
 UserGroupInformation.getCurrentUser().getShortUserName()
 getUserName() returns the users full principal name (ie us...@corp.com). 
 getShortUserName() returns just the users login name (user1).
 This just happens to work on YARN because the Client code sets:
 env(SPARK_USER) = 
 UserGroupInformation.getCurrentUser().getShortUserName()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet

2015-03-27 Thread Jon Chase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Chase updated SPARK-6570:
-
Summary: Spark SQL arrays: explode() fails and cannot save array type to 
Parquet  (was: Spark SQL explode() fails, assumes underlying SQL array is 
represented by Scala Seq)

 Spark SQL arrays: explode() fails and cannot save array type to Parquet
 -

 Key: SPARK-6570
 URL: https://issues.apache.org/jira/browse/SPARK-6570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase

 {code}
 @Rule
 public TemporaryFolder tmp = new TemporaryFolder();
 @Test
 public void testPercentileWithExplode() throws Exception {
 StructType schema = DataTypes.createStructType(Lists.newArrayList(
 DataTypes.createStructField(col1, DataTypes.StringType, 
 false),
 DataTypes.createStructField(col2s, 
 DataTypes.createArrayType(DataTypes.IntegerType, true), true)
 ));
 JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList(
 RowFactory.create(test, new int[]{1, 2, 3})
 ));
 DataFrame df = sql.createDataFrame(rowRDD, schema);
 df.registerTempTable(df);
 df.printSchema();
 Listint[] ints = sql.sql(select col2s from df).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, ints.size());
 assertArrayEquals(new int[]{1, 2, 3}, ints.get(0));
 // fails: lateral view explode does not work: 
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
 ListInteger explodedInts = sql.sql(select col2 from df lateral 
 view explode(col2s) splode as col2).javaRDD()
 .map(row - row.getInt(0)).collect();
 assertEquals(3, explodedInts.size());
 assertEquals(Lists.newArrayList(1, 2, 3), explodedInts);
 // fails: java.lang.ClassCastException: [I cannot be cast to 
 scala.collection.Seq
 df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet);
 DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + 
 /parquet);
 loadedDf.registerTempTable(loadedDf);
 Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, moreInts.size());
 assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0));
 }
 {code}
 {code}
 root
  |-- col1: string (nullable = false)
  |-- col2s: array (nullable = true)
  ||-- element: integer (containsNull = true)
 ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
 (TID 15)
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
   at 
 org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) 
 ~[spark-catalyst_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6570) Spark SQL explode() fails, assumes underlying SQL array is represented by Scala Seq

2015-03-27 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383794#comment-14383794
 ] 

Jon Chase commented on SPARK-6570:
--

Stack trace for saveAsParquetFile():

{code}
root
 |-- col1: string (nullable = false)
 |-- col2s: array (nullable = true)
 ||-- element: integer (containsNull = true)

SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
(TID 15)
java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
 ~[parquet-hadoop-1.6.0rc3.jar:na]
at 
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) 
~[parquet-hadoop-1.6.0rc3.jar:na]
at 
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) 
~[parquet-hadoop-1.6.0rc3.jar:na]
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
 ~[spark-sql_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at org.apache.spark.scheduler.Task.run(Task.scala:64) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) 
~[spark-core_2.10-1.3.0.jar:1.3.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_31]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_31]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
WARN  o.a.spark.scheduler.TaskSetManager Lost task 7.0 in stage 1.0 (TID 15, 
localhost): java.lang.ClassCastException: [I cannot be cast to 
scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:631)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR o.a.spark.scheduler.TaskSetManager Task 7 in stage 1.0 failed 1 times; 
aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 1.0 failed 1 times, most recent failure: Lost task 7.0 in stage 1.0 (TID 
15, localhost): java.lang.ClassCastException: [I cannot be cast to 
scala.collection.Seq
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:185)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at 

[jira] [Assigned] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1684:
---

Assignee: Patrick Wendell  (was: Apache Spark)

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
  Labels: starter
 Attachments: spark_pulls_before_after.txt


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6571) MatrixFactorizationModel created by load fails on predictAll

2015-03-27 Thread Charles Hayden (JIRA)
Charles Hayden created SPARK-6571:
-

 Summary: MatrixFactorizationModel created by load fails on 
predictAll
 Key: SPARK-6571
 URL: https://issues.apache.org/jira/browse/SPARK-6571
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden


This code, adapted from the documentation, fails when using a loaded model.
from pyspark.mllib.recommendation import ALS, Rating, MatrixFactorizationModel

r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
model = ALS.trainImplicit(ratings, 1, seed=10)
print '(2, 2)', model.predict(2, 2)
#0.43...
testset = sc.parallelize([(1, 2), (1, 1)])
print 'all', model.predictAll(testset).collect()
#[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, 
rating=1.9...)]
import os, tempfile
path = tempfile.mkdtemp()
model.save(sc, path)
sameModel = MatrixFactorizationModel.load(sc, path)
print '(2, 2)', sameModel.predict(2,2)
sameModel.predictAll(testset).collect()


This gives
(2, 2) 0.443547642944
all [Rating(user=1, product=1, rating=1.1538351103381217), Rating(user=1, 
product=2, rating=0.7153473708381739)]
(2, 2) 0.443547642944
---
Py4JError Traceback (most recent call last)
ipython-input-18-af6612bed9d0 in module()
 19 sameModel = MatrixFactorizationModel.load(sc, path)
 20 print '(2, 2)', sameModel.predict(2,2)
--- 21 sameModel.predictAll(testset).collect()
 22 

/home/ubuntu/spark/python/pyspark/mllib/recommendation.pyc in predictAll(self, 
user_product)
104 assert len(first) == 2, user_product should be RDD of (user, 
product)
105 user_product = user_product.map(lambda (u, p): (int(u), int(p)))
-- 106 return self.call(predict, user_product)
107 
108 def userFeatures(self):

/home/ubuntu/spark/python/pyspark/mllib/common.pyc in call(self, name, *a)
134 def call(self, name, *a):
135 Call method of java_model
-- 136 return callJavaFunc(self._sc, getattr(self._java_model, name), 
*a)
137 
138 

/home/ubuntu/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, 
*args)
111  Call Java Function 
112 args = [_py2java(sc, a) for a in args]
-- 113 return _java2py(sc, func(*args))
114 
115 

/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}. 
Trace:\n{3}\n'.
-- 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(

Py4JError: An error occurred while calling o450.predict. Trace:
py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) 
does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6348:
---

Assignee: (was: Apache Spark)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name

2015-03-27 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-6558:


Assignee: Thomas Graves

 Utils.getCurrentUserName returns the full principal name instead of login name
 --

 Key: SPARK-6558
 URL: https://issues.apache.org/jira/browse/SPARK-6558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical

 Utils.getCurrentUserName returns 
 UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't 
 set.  It should return 
 UserGroupInformation.getCurrentUser().getShortUserName()
 getUserName() returns the users full principal name (ie us...@corp.com). 
 getShortUserName() returns just the users login name (user1).
 This just happens to work on YARN because the Client code sets:
 env(SPARK_USER) = 
 UserGroupInformation.getCurrentUser().getShortUserName()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383856#comment-14383856
 ] 

Thomas Graves commented on SPARK-5493:
--

[~vanzin]  I must be missing something.  Why is this feature needed?

I can run spark through oozie just fine without this on a secure yarn cluster. 
(and jobs run as the correct user)  

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1684:
---

Assignee: Apache Spark  (was: Patrick Wendell)

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Apache Spark
Priority: Minor
  Labels: starter
 Attachments: spark_pulls_before_after.txt


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6348:
---

Assignee: Apache Spark

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Assignee: Apache Spark
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6558:
---

Assignee: Thomas Graves  (was: Apache Spark)

 Utils.getCurrentUserName returns the full principal name instead of login name
 --

 Key: SPARK-6558
 URL: https://issues.apache.org/jira/browse/SPARK-6558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical

 Utils.getCurrentUserName returns 
 UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't 
 set.  It should return 
 UserGroupInformation.getCurrentUser().getShortUserName()
 getUserName() returns the users full principal name (ie us...@corp.com). 
 getShortUserName() returns just the users login name (user1).
 This just happens to work on YARN because the Client code sets:
 env(SPARK_USER) = 
 UserGroupInformation.getCurrentUser().getShortUserName()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383843#comment-14383843
 ] 

Apache Spark commented on SPARK-6558:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/5229

 Utils.getCurrentUserName returns the full principal name instead of login name
 --

 Key: SPARK-6558
 URL: https://issues.apache.org/jira/browse/SPARK-6558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical

 Utils.getCurrentUserName returns 
 UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't 
 set.  It should return 
 UserGroupInformation.getCurrentUser().getShortUserName()
 getUserName() returns the users full principal name (ie us...@corp.com). 
 getShortUserName() returns just the users login name (user1).
 This just happens to work on YARN because the Client code sets:
 env(SPARK_USER) = 
 UserGroupInformation.getCurrentUser().getShortUserName()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383856#comment-14383856
 ] 

Thomas Graves edited comment on SPARK-5493 at 3/27/15 2:00 PM:
---

[~vanzin]  I must be missing something.  Why is this feature needed?

I can run spark through oozie just fine without this on a secure yarn cluster. 
(and jobs run as the correct user)  

perhaps needed by hive?  Or is it just to allow a proxy user to manually run 
things (ie not through oozie), which seems a bit odd to me.


was (Author: tgraves):
[~vanzin]  I must be missing something.  Why is this feature needed?

I can run spark through oozie just fine without this on a secure yarn cluster. 
(and jobs run as the correct user)  

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6544.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5193
[https://github.com/apache/spark/pull/5193]

 Problem with Avro and Kryo Serialization
 

 Key: SPARK-6544
 URL: https://issues.apache.org/jira/browse/SPARK-6544
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Dean Chen
 Fix For: 1.4.0


 We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
 causing jobs to fail
 https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
 PR here
 https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1:

2015-03-27 Thread Frank Domoney (JIRA)
Frank Domoney created SPARK-6572:


 Summary: When I build Spark 1.3 sbt gives me to following error
: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found  
org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, 
completed 27-Mar-2015 14:24:39
 Key: SPARK-6572
 URL: https://issues.apache.org/jira/browse/SPARK-6572
 Project: Spark
  Issue Type: Bug
Reporter: Frank Domoney






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Platon Potapov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov reopened SPARK-6569:
---

Sean, please explain if the condition really mandates a warning being logged.

The scenario in which this record gets logged seems to be just that there is no 
new data in Kafka topic (the kafka reader is at the head of the topic) - 
isn't that the case ?


 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-27 Thread Fabian Boehnlein (JIRA)
Fabian Boehnlein created SPARK-6573:
---

 Summary: expect pandas null values as numpy.nan (not only as None)
 Key: SPARK-6573
 URL: https://issues.apache.org/jira/browse/SPARK-6573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fabian Boehnlein


In pandas it is common to use numpy.nan as the null value, for missing data or 
whatever.

http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna

createDataFrame however only works with None as null values, parsing them as 
None in the RDD.

I suggest to add support for np.nan values in pandas DataFrames.

current stracktrace when calling a DataFrame with object type columns with 
np.nan values (which are floats)
{code}
TypeError Traceback (most recent call last)
ipython-input-38-34f0263f0bf4 in module()
 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
createDataFrame(self, data, schema, samplingRatio)
339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), 
samplingRatio)
340 
-- 341 return self.applySchema(data, schema)
342 
343 def registerDataFrameAsTable(self, rdd, tableName):

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
applySchema(self, rdd, schema)
246 
247 for row in rows:
-- 248 _verify_type(row, schema)
249 
250 # convert python objects to sql data

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
_verify_type(obj, dataType)
   1064  length of fields (%d) % (len(obj), 
len(dataType.fields)))
   1065 for v, f in zip(obj, dataType.fields):
- 1066 _verify_type(v, f.dataType)
   1067 
   1068 _cached_cls = weakref.WeakValueDictionary()

/opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
_verify_type(obj, dataType)
   1048 if type(obj) not in _acceptable_types[_type]:
   1049 raise TypeError(%s can not accept object in type %s
- 1050 % (dataType, type(obj)))
   1051 
   1052 if isinstance(dataType, ArrayType):

TypeError: StringType can not accept object in type type 'float'{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383973#comment-14383973
 ] 

Sean Owen commented on SPARK-6569:
--

[~c...@koeninger.org] what do you think of the warning? if it's more suitable 
as info we can reopen and address that. The interpolation was already fixed 
separately.

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-27 Thread Fabian Boehnlein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Boehnlein updated SPARK-6573:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6116

 expect pandas null values as numpy.nan (not only as None)
 -

 Key: SPARK-6573
 URL: https://issues.apache.org/jira/browse/SPARK-6573
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fabian Boehnlein

 In pandas it is common to use numpy.nan as the null value, for missing data 
 or whatever.
 http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
 http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
 http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
 createDataFrame however only works with None as null values, parsing them as 
 None in the RDD.
 I suggest to add support for np.nan values in pandas DataFrames.
 current stracktrace when calling a DataFrame with object type columns with 
 np.nan values (which are floats)
 {code}
 TypeError Traceback (most recent call last)
 ipython-input-38-34f0263f0bf4 in module()
  1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 createDataFrame(self, data, schema, samplingRatio)
 339 schema = self._inferSchema(data.map(lambda r: 
 row_cls(*r)), samplingRatio)
 340 
 -- 341 return self.applySchema(data, schema)
 342 
 343 def registerDataFrameAsTable(self, rdd, tableName):
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 applySchema(self, rdd, schema)
 246 
 247 for row in rows:
 -- 248 _verify_type(row, schema)
 249 
 250 # convert python objects to sql data
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1064  length of fields (%d) % (len(obj), 
 len(dataType.fields)))
1065 for v, f in zip(obj, dataType.fields):
 - 1066 _verify_type(v, f.dataType)
1067 
1068 _cached_cls = weakref.WeakValueDictionary()
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1048 if type(obj) not in _acceptable_types[_type]:
1049 raise TypeError(%s can not accept object in type %s
 - 1050 % (dataType, type(obj)))
1051 
1052 if isinstance(dataType, ArrayType):
 TypeError: StringType can not accept object in type type 'float'{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6574) Python Example sql.py not working in version 1.3

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6574:
---

Assignee: Apache Spark  (was: Davies Liu)

 Python Example sql.py not working in version 1.3
 

 Key: SPARK-6574
 URL: https://issues.apache.org/jira/browse/SPARK-6574
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 I downloaded spark version spark-1.3.0-bin-hadoop2.4.
 When the python version of sql.py is run the following error occurs:
 [root@nde-dev8-template python]#
 /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Traceback (most recent call last):
   File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py,
 line 22, in module
 from pyspark.sql import Row, StructField, StructType, StringType,
 IntegerType
 ImportError: cannot import name StructField
 --
 The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error:
 [root@nde-dev8-template python]#
 /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from
 SecureRandom within 3 seconds. Not enough entrophy?
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 root
  |-- person_name: string (nullable = false)
  |-- person_age: integer (nullable = false)
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 Justin
 -
 The OS/JAVA environments are:
 OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11
 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
 JAVA: [root@nde-dev8-template bin]# java -version
 java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 The same error occurs when using bin/pyspark shell.
  from pyspark.sql import StructField
 Traceback (most recent call last):
   File stdin, line 1, in module
 ImportError: cannot import name StructField
 ---
 Any advice for resolving? Thanks in advance.
 Peter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384140#comment-14384140
 ] 

Marcelo Vanzin commented on SPARK-5493:
---

I'm not terribly familiar with how Oozie handles Spark, but Hive with 
impersonation enabled needs this.

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384145#comment-14384145
 ] 

Marcelo Vanzin commented on SPARK-5493:
---

Direct link:
https://github.com/apache/hive/blob/spark/spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java#L370

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader

2015-03-27 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384142#comment-14384142
 ] 

sam commented on SPARK-4660:


Furthermore it seems this issue is more likely to happen when I try to process 
more data.

 JavaSerializer uses wrong classloader
 -

 Key: SPARK-4660
 URL: https://issues.apache.org/jira/browse/SPARK-4660
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.1.1
Reporter: Piotr Kołaczkowski
Assignee: Piotr Kołaczkowski
Priority: Critical
 Fix For: 1.1.2, 1.2.1, 1.3.0

 Attachments: spark-serializer-classloader.patch


 During testing we found failures when trying to load some classes of the user 
 application:
 {noformat}
 ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: 
 Exception handling buffer message
 java.lang.ClassNotFoundException: 
 org.apache.spark.demo.HttpReceiverCases$HttpRequest
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at org.apache.spark.serializer.JavaDeseriali
 zationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
   at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)
   at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748)
   at 
 org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639)
   at 
 org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92)
   at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
   at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
   at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
   at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682)
   at 
 org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Created] (SPARK-6574) Python Example sql.py not working in version 1.3

2015-03-27 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6574:
-

 Summary: Python Example sql.py not working in version 1.3
 Key: SPARK-6574
 URL: https://issues.apache.org/jira/browse/SPARK-6574
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


I downloaded spark version spark-1.3.0-bin-hadoop2.4.

When the python version of sql.py is run the following error occurs:

[root@nde-dev8-template python]#
/root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Traceback (most recent call last):
  File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py,
line 22, in module
from pyspark.sql import Row, StructField, StructType, StringType,
IntegerType
ImportError: cannot import name StructField

--
The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error:

[root@nde-dev8-template python]#
/root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from
SecureRandom within 3 seconds. Not enough entrophy?
root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

root
 |-- person_name: string (nullable = false)
 |-- person_age: integer (nullable = false)

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Justin


-

The OS/JAVA environments are:

OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11
17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

JAVA: [root@nde-dev8-template bin]# java -version
java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

The same error occurs when using bin/pyspark shell.

 from pyspark.sql import StructField
Traceback (most recent call last):
  File stdin, line 1, in module
ImportError: cannot import name StructField


---

Any advice for resolving? Thanks in advance.

Peter




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF

2015-03-27 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384161#comment-14384161
 ] 

Michael Armbrust commented on SPARK-6565:
-

It is not that it returns an RDD, it is that it takes an RDD of json data.  
Just like jsonFile does not return a file.

 Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
 --

 Key: SPARK-6565
 URL: https://issues.apache.org/jira/browse/SPARK-6565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor

 Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the 
 original name becomes confusing. Would be better to deprecate it and add 
 {{jsonDataFrame}} or {{jsonDF}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6574) Python Example sql.py not working in version 1.3

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6574:
---

Assignee: Davies Liu  (was: Apache Spark)

 Python Example sql.py not working in version 1.3
 

 Key: SPARK-6574
 URL: https://issues.apache.org/jira/browse/SPARK-6574
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 I downloaded spark version spark-1.3.0-bin-hadoop2.4.
 When the python version of sql.py is run the following error occurs:
 [root@nde-dev8-template python]#
 /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Traceback (most recent call last):
   File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py,
 line 22, in module
 from pyspark.sql import Row, StructField, StructType, StringType,
 IntegerType
 ImportError: cannot import name StructField
 --
 The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error:
 [root@nde-dev8-template python]#
 /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from
 SecureRandom within 3 seconds. Not enough entrophy?
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 root
  |-- person_name: string (nullable = false)
  |-- person_age: integer (nullable = false)
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 Justin
 -
 The OS/JAVA environments are:
 OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11
 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
 JAVA: [root@nde-dev8-template bin]# java -version
 java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 The same error occurs when using bin/pyspark shell.
  from pyspark.sql import StructField
 Traceback (most recent call last):
   File stdin, line 1, in module
 ImportError: cannot import name StructField
 ---
 Any advice for resolving? Thanks in advance.
 Peter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384218#comment-14384218
 ] 

Brock Noland commented on SPARK-5493:
-

I don't know 100% about how oozie works but I believe it submits a map only 
task to do the actual job submission. The {{doas}} is done before submitting 
the map task which does the job submission. In the Hive case we do not have 
this infrastructure and it would introduce significant latency to HOS queries.

While {{HADOOP_PROXY_USER}} might work for testing HOS will be used in 
production in the near future. This feature was created for those production 
use cases.

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like hive might want to submit jobs as 
 a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6574) Python Example sql.py not working in version 1.3

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384172#comment-14384172
 ] 

Apache Spark commented on SPARK-6574:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5230

 Python Example sql.py not working in version 1.3
 

 Key: SPARK-6574
 URL: https://issues.apache.org/jira/browse/SPARK-6574
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 I downloaded spark version spark-1.3.0-bin-hadoop2.4.
 When the python version of sql.py is run the following error occurs:
 [root@nde-dev8-template python]#
 /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Traceback (most recent call last):
   File /root/spark-1.3.0-bin-hadoop2.4/examples/src/main/python/sql.py,
 line 22, in module
 from pyspark.sql import Row, StructField, StructType, StringType,
 IntegerType
 ImportError: cannot import name StructField
 --
 The sql.py version, spark-1.2.1-bin-hadoop2.4, does not throw the error:
 [root@nde-dev8-template python]#
 /root/spark-1.2.1-bin-hadoop2.4/bin/spark-submit sql.py
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 15/03/27 14:18:44 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 14:19:41 WARN ThreadLocalRandom: Failed to generate a seed from
 SecureRandom within 3 seconds. Not enough entrophy?
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 root
  |-- person_name: string (nullable = false)
  |-- person_age: integer (nullable = false)
 root
  |-- age: integer (nullable = true)
  |-- name: string (nullable = true)
 Justin
 -
 The OS/JAVA environments are:
 OS: Linux nde-dev8-template 2.6.32-431.17.1.el6.x86_64 #1 SMP Fri Apr 11
 17:27:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
 JAVA: [root@nde-dev8-template bin]# java -version
 java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 The same error occurs when using bin/pyspark shell.
  from pyspark.sql import StructField
 Traceback (most recent call last):
   File stdin, line 1, in module
 ImportError: cannot import name StructField
 ---
 Any advice for resolving? Thanks in advance.
 Peter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5493) Support proxy users under kerberos

2015-03-27 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-5493:
-
Description: When using kerberos, services may want to use spark-submit to 
submit jobs as a separate user. For example a service like hive might want to 
submit jobs as a client user.  (was: When using kerberos, services may want to 
use spark-submit to submit jobs as a separate user. For example a service like 
oozie might want to submit jobs as a client user.)

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like hive might want to submit jobs as 
 a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1

2015-03-27 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383999#comment-14383999
 ] 

Cheng Lian commented on SPARK-6572:
---

Would you please provide exact command line you used to invoke SBT?

 When I build Spark 1.3 sbt gives me to following error: unresolved 
 dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found  
 org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, 
 completed 27-Mar-2015 14:24:39
 

 Key: SPARK-6572
 URL: https://issues.apache.org/jira/browse/SPARK-6572
 Project: Spark
  Issue Type: Bug
Reporter: Frank Domoney





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-03-27 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384038#comment-14384038
 ] 

Cody Koeninger commented on SPARK-6569:
---

I set it as warn because an empty batch can be the source of non-obvious 
problems that would be obscured if it was at the info level. Streams that don't 
get even one item during a batch are relatively rare for my use cases.
 
I don't feel super strongly about it, though, if there's a reason to reduce the 
log level.

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1

2015-03-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384037#comment-14384037
 ] 

Sean Owen commented on SPARK-6572:
--

It builds correctly for me in branch 1.3 with {{build/sbt -Pyarn -Phadoop-2.3 
assembly}}. [~Panzerfrank] that is not a URL, it's just a printing of Maven 
coordinates and is correct. Is your SBT somehow not picking up compiler 
plugins? that would cause this, I think. 

 When I build Spark 1.3 sbt gives me to following error: unresolved 
 dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found  
 org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, 
 completed 27-Mar-2015 14:24:39
 

 Key: SPARK-6572
 URL: https://issues.apache.org/jira/browse/SPARK-6572
 Project: Spark
  Issue Type: Bug
Reporter: Frank Domoney





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1

2015-03-27 Thread Frank Domoney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384006#comment-14384006
 ] 

Frank Domoney commented on SPARK-6572:
--

the correct URL for the kafka is kafka_2.11-0.8.1.1


 When I build Spark 1.3 sbt gives me to following error: unresolved 
 dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found  
 org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, 
 completed 27-Mar-2015 14:24:39
 

 Key: SPARK-6572
 URL: https://issues.apache.org/jira/browse/SPARK-6572
 Project: Spark
  Issue Type: Bug
Reporter: Frank Domoney





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6576) DenseMatrix in PySpark should support indexing

2015-03-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384432#comment-14384432
 ] 

Apache Spark commented on SPARK-6576:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/5232

 DenseMatrix in PySpark should support indexing
 --

 Key: SPARK-6576
 URL: https://issues.apache.org/jira/browse/SPARK-6576
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6576) DenseMatrix in PySpark should support indexing

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6576:
---

Assignee: (was: Apache Spark)

 DenseMatrix in PySpark should support indexing
 --

 Key: SPARK-6576
 URL: https://issues.apache.org/jira/browse/SPARK-6576
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM

2015-03-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4069:
---

Assignee: (was: Apache Spark)

 [SPARK-YARN] ApplicationMaster should release all executors' containers 
 before unregistering itself from Yarn RM
 

 Key: SPARK-4069
 URL: https://issues.apache.org/jira/browse/SPARK-4069
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Min Zhou

 Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
 master , a.k.a resourcemanager.  Itnever release executors' containers before 
 that.  Yarn's master will make a decision to kill all the executors' 
 containers if it face such scenario.  so the log of resourcemanager is like 
 below 
 {noformat}
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
 application application_1414003182949_0004 with final state: FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 ATTEMPT_UPDATE_SAVED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
 info for app: application_1414003182949_0004
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINAL_SAVING to 
 FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
 2014-10-22 23:39:10,485 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 CONTAINER_FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1414003182949_0004_01_01 Container Transitioned from RUNNING to 
 COMPLETED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Unregistering app attempt : appattempt_1414003182949_0004_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
 Completed container: container_1414003182949_0004_01_01 in state: 
 COMPLETED event:FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of container container_1414003182949_0004_01_01 is 
 written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
 APPID=application_1414003182949_0004
 CONTAINERID=container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
 Stored the finish data of container container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Released container container_1414003182949_0004_01_01 of capacity 
 memory:3072, vCores:1 on host host1, which currently has 0 containers, 
 memory:0, vCores:0 used and memory:241901, vCores:32 available, release 
 resources=true
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of application attempt 
 appattempt_1414003182949_0004_01 is written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
 OPERATION=Application 

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384520#comment-14384520
 ] 

Steve Loughran commented on SPARK-6479:
---

Henry: utterly unrelated. I was merely offering to help define this API more 
formally  derive tests from it.

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
 Attachments: SparkOffheapsupportbyHDFS.pdf


 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-03-27 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384525#comment-14384525
 ] 

Henry Saputra commented on SPARK-6479:
--

@Steve: Ah cool, thanks for clarifying =)

 Create off-heap block storage API (internal)
 

 Key: SPARK-6479
 URL: https://issues.apache.org/jira/browse/SPARK-6479
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Reporter: Reynold Xin
 Attachments: SparkOffheapsupportbyHDFS.pdf


 Would be great to create APIs for off-heap block stores, rather than doing a 
 bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >