date:20160216

[jira] [Updated] (SPARK-13342) Cannot run INSERT statements in Spark

2016-02-16 Thread neo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neo updated SPARK-13342:

Description: 
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and create the table / insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO TABLE stgTable VALUES ('12',12)

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either



  was:
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and create the table / insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO table stgTable SELECT 'sno1',124 from stgTable limit 1

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either




> Cannot run INSERT statements in Spark
> -
>
> Key: SPARK-13342
> URL: https://issues.apache.org/jira/browse/SPARK-13342
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.6.0
>Reporter: neo
>
> I cannot run a INSERT statement using spark-sql. I tried with both versions 
> 1.5.1 and 1.6.0 without any luck. But it runs ok on hive.
> These are the steps I took.
> 1) Launch hive and create the table / insert a record.
> create database test
> use test
> CREATE TABLE stgTable
> (
> sno string,
> total bigint
> );
> INSERT INTO TABLE stgTable VALUES ('12',12)
> 2) Launch spark-sql (1.5.1 or 1.6.0)
> 3) Try inserting a record from the shell
> INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1
> I got this error message 
> "Invalid method name: 'alter_table_with_cascade'"
> I tried changing the hive version inside the spark-sql shell  using
> from
> SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
> installation)
> to
> SET spark.sql.hive.version=0.14.0
> but that did not help either



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13342) Cannot run INSERT statements in Spark

2016-02-16 Thread neo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neo updated SPARK-13342:

Description: 
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and create the table / insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO TABLE stgTable VALUES ('12',12)

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using SET command.
I changed the hive version
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either



  was:
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and create the table / insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO TABLE stgTable VALUES ('12',12)

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either




> Cannot run INSERT statements in Spark
> -
>
> Key: SPARK-13342
> URL: https://issues.apache.org/jira/browse/SPARK-13342
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.6.0
>Reporter: neo
>
> I cannot run a INSERT statement using spark-sql. I tried with both versions 
> 1.5.1 and 1.6.0 without any luck. But it runs ok on hive.
> These are the steps I took.
> 1) Launch hive and create the table / insert a record.
> create database test
> use test
> CREATE TABLE stgTable
> (
> sno string,
> total bigint
> );
> INSERT INTO TABLE stgTable VALUES ('12',12)
> 2) Launch spark-sql (1.5.1 or 1.6.0)
> 3) Try inserting a record from the shell
> INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1
> I got this error message 
> "Invalid method name: 'alter_table_with_cascade'"
> I tried changing the hive version inside the spark-sql shell  using SET 
> command.
> I changed the hive version
> from
> SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
> installation)
> to
> SET spark.sql.hive.version=0.14.0
> but that did not help either



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13342) Cannot run INSERT statements in Spark

2016-02-16 Thread neo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neo updated SPARK-13342:

Description: 
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and create the table / insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO table stgTable SELECT 'sno1',124 from stgTable limit 1

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either



  was:
I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and Create the table  and insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO table stgTable SELECT 'sno1',124 from stgTable limit 1

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either




> Cannot run INSERT statements in Spark
> -
>
> Key: SPARK-13342
> URL: https://issues.apache.org/jira/browse/SPARK-13342
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.6.0
>Reporter: neo
>
> I cannot run a INSERT statement using spark-sql. I tried with both versions 
> 1.5.1 and 1.6.0 without any luck. But it runs ok on hive.
> These are the steps I took.
> 1) Launch hive and create the table / insert a record.
> create database test
> use test
> CREATE TABLE stgTable
> (
> sno string,
> total bigint
> );
> INSERT INTO table stgTable SELECT 'sno1',124 from stgTable limit 1
> 2) Launch spark-sql (1.5.1 or 1.6.0)
> 3) Try inserting a record from the shell
> INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1
> I got this error message 
> "Invalid method name: 'alter_table_with_cascade'"
> I tried changing the hive version inside the spark-sql shell  using
> from
> SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
> installation)
> to
> SET spark.sql.hive.version=0.14.0
> but that did not help either



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13342) Cannot run INSERT statements in Spark

2016-02-16 Thread neo (JIRA)

neo created SPARK-13342:
---

 Summary: Cannot run INSERT statements in Spark
 Key: SPARK-13342
 URL: https://issues.apache.org/jira/browse/SPARK-13342
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0, 1.5.1
Reporter: neo


I cannot run a INSERT statement using spark-sql. I tried with both versions 
1.5.1 and 1.6.0 without any luck. But it runs ok on hive.

These are the steps I took.

1) Launch hive and Create the table  and insert a record.

create database test
use test
CREATE TABLE stgTable
(
sno string,
total bigint
);

INSERT INTO table stgTable SELECT 'sno1',124 from stgTable limit 1

2) Launch spark-sql (1.5.1 or 1.6.0)
3) Try inserting a record from the shell
INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1

I got this error message 
"Invalid method name: 'alter_table_with_cascade'"

I tried changing the hive version inside the spark-sql shell  using
from
SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
installation)
to
SET spark.sql.hive.version=0.14.0

but that did not help either





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Qi Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148937#comment-15148937
 ] 

Qi Dai commented on SPARK-13289:


Yes, just download the data from the url. Unzip it and change the path the the 
training data folder. Then, step through the rest parts and the issue should be 
able to reproduced. (probably it's better to run on a cluster because the 
dataset is big) 

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148923#comment-15148923
 ] 

Sean Owen commented on SPARK-13289:
---

That's not usually how it works; I wouldn't expect others to debug for you. 
Just step through the code?


> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Qi Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148920#comment-15148920
 ] 

Qi Dai commented on SPARK-13289:


I'm not familiar with the algorithm and implementation. Maybe we need to wait 
for some other people in the community who involved in the implementation to 
take a look at the issue.

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13341) Casting Unix timestamp to SQL timestamp fails

2016-02-16 Thread William Dee (JIRA)

William Dee created SPARK-13341:
---

 Summary: Casting Unix timestamp to SQL timestamp fails
 Key: SPARK-13341
 URL: https://issues.apache.org/jira/browse/SPARK-13341
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: William Dee


The way that unix timestamp casting is handled has been broken between Spark 
1.5.2 and Spark 1.6.0. This can be easily demonstrated via the spark-shell:

{code:title=1.5.2}
scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, 
CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show
++--+
|  ts| d|
++--+
|2016-02-16 00:00:...|2016-02-16|
++--+
{code}

{code:title=1.6.0}
scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, 
CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show
++--+
|  ts| d|
++--+
|48095-07-09 12:06...|095-07-09|
++--+
{code}

I'm not sure what exactly is causing this but this defect has definitely been 
introduced in Spark 1.6.0 as jobs that relied on this functionality ran on 
1.5.2 and now don't run on 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148876#comment-15148876
 ] 

Sean Owen commented on SPARK-13289:
---

I haven't run it. It'd be great if you can run with this and propose a fix?

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Qi Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148872#comment-15148872
 ] 

Qi Dai edited comment on SPARK-13289 at 2/16/16 4:35 PM:
-

Hi Sean, Are you able to reproduce the issue? Do you need any other details? 
(Maybe the reply I made few days ago didn't send to you. If so, could you take 
a look at the details I provided below?) I tried some other parameters. It 
looks like it's more likely to fail with larger dataset, more partitions and 
more iterations. 


was (Author: daiqi5477):
Hi Sean, Are you able to reproduce the issue? Do you need any other details? I 
tried some other parameters. It looks like it's more likely to fail with larger 
dataset, more partitions and more iterations. 

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-16 Thread Qi Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148872#comment-15148872
 ] 

Qi Dai commented on SPARK-13289:


Hi Sean, Are you able to reproduce the issue? Do you need any other details? I 
tried some other parameters. It looks like it's more likely to fail with larger 
dataset, more partitions and more iterations. 

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4224) Support group acls

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148796#comment-15148796
 ] 

Sean Owen commented on SPARK-4224:
--

I'd encourage anyone to read 
http://www.joelonsoftware.com/items/2012/07/09.html starting especially around 
"feature backlogs". I think Spark suffers badly from this problem, since a lot 
of well-meaning contributors (this is not directed at you or anyone in 
particular) have never experienced this and tend to think filing JIRAs is 
actually a contribution in itself.

If something's on a to-do list that long, it's not that important, and by 
definition isn't really something worthy of a list of to-dos that you share and 
plan around. That is, if you (or anyone else) haven't bothered in 16 months, 
why now? I find keeping things open is merely seductive: maybe someone will 
work on it for us? we can't seriously not do this right? -- but is actually 
damaging. The flip side is: someone might take from this that group ACLs will 
be implemented, when evidence is that it will not be. It indicates someone 
should open a PR for it, when I don't know if anyone would review it.

I'd prefer to leave it, go investigate why it hasn't been prioritized, and then 
decide to prioritize it -- and then reopen it. But, this exchange is more than 
enough for the moment.

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148776#comment-15148776
 ] 

Sean Owen commented on SPARK-2421:
--

Pretty much the same reason in all cases: no activity in 16 months, a 
nice-to-have (i.e. not a known bug), nobody asking for it or putting any work 
into or reason to expect activity. It's more fruitful to reflect reality -- and 
then if desired ask, why is nobody (like yourself) working on it if you are 
interested?

Or: reopen it but please, only if there is a good-faith reason to expect 
someone will work on it imminently. Remember, things can be reopened later. 
Worst case, new issues can be opened. We can also make a different resolution 
like "Later" if people find that softer.

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13317) SPARK_LOCAL_IP does not bind to public IP on Slaves

2016-02-16 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148705#comment-15148705
 ] 

Christopher Bourez edited comment on SPARK-13317 at 2/16/16 2:54 PM:
-

I'm trying my best, second time, but when I specify the public IP with 
{code}SPARK_PUBLIC_IP{code}in spark-env.sh and restart, I get an error during 
spark context initialization in spark-shell : 

{code}
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext.
java.net.BindException:
{code}

Do you have any clue ?


was (Author: christopher5106):
I'm trying my best, second time, but when I specify the public IP with 
{code}SPARK_PUBLIC_ID{code} I get an error during spark context initialization 
in spark-shell : 

{code}
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext.
java.net.BindException:
{code}

Do you have any clue ?

> SPARK_LOCAL_IP does not bind to public IP on Slaves
> ---
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, EC2
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>Priority: Minor
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave

[jira] [Commented] (SPARK-13317) SPARK_LOCAL_IP does not bind to public IP on Slaves

2016-02-16 Thread Christopher Bourez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148705#comment-15148705
 ] 

Christopher Bourez commented on SPARK-13317:


I'm trying my best, second time, but when I specify the public IP with 
{code}SPARK_PUBLIC_ID{code} I get an error during spark context initialization 
in spark-shell : 

{code}
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.
16/02/16 14:40:51 ERROR SparkContext: Error initializing SparkContext.
java.net.BindException:
{code}

Do you have any clue ?

> SPARK_LOCAL_IP does not bind to public IP on Slaves
> ---
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, EC2
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>Priority: Minor
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13340) [ML] PolynomialExpansion and Normalizer should validate input type

2016-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148696#comment-15148696
 ] 

Apache Spark commented on SPARK-13340:
--

User 'grzegorz-chilkiewicz' has created a pull request for this issue:
https://github.com/apache/spark/pull/11218

> [ML] PolynomialExpansion and Normalizer should validate input type
> --
>
> Key: SPARK-13340
> URL: https://issues.apache.org/jira/browse/SPARK-13340
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> PolynomialExpansion and Normalizer should override 
> UnaryTransformer::validateInputType
> Now, in case of trying to operate on String column:
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.mllib.linalg.Vector
> is thrown, but:
> java.lang.IllegalArgumentException: requirement failed: Input type must be 
> VectorUDT but got StringType
> will be more clear and adequate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13340) [ML] PolynomialExpansion and Normalizer should validate input type

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13340:


Assignee: (was: Apache Spark)

> [ML] PolynomialExpansion and Normalizer should validate input type
> --
>
> Key: SPARK-13340
> URL: https://issues.apache.org/jira/browse/SPARK-13340
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> PolynomialExpansion and Normalizer should override 
> UnaryTransformer::validateInputType
> Now, in case of trying to operate on String column:
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.mllib.linalg.Vector
> is thrown, but:
> java.lang.IllegalArgumentException: requirement failed: Input type must be 
> VectorUDT but got StringType
> will be more clear and adequate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2016-02-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148684#comment-15148684
 ] 

Xuefu Zhang commented on SPARK-2421:


[~sowen], I saw you had closed this without giving any explanation. Do you mind 
sharing?

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4224) Support group acls

2016-02-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148682#comment-15148682
 ] 

Thomas Graves commented on SPARK-4224:
--

Ok, I understand closing older things but in this instance I would like it to 
stay open. Also it would be nice if you put comment in there stating why it was 
closed.

It is still on our list of todos and its a feature that I would definitely like 
in Spark.  It makes certain things much easier for organizations and it you 
look at many other open source products (hadoop, storm, etc) they all support 
group acls. 

When you work in teams (with 10-30 people) its much easier to just add groups 
to the acls then list out individual users.  

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.

2016-02-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148675#comment-15148675
 ] 

Xuefu Zhang commented on SPARK-5377:


[~sowen], I saw you had closed this without giving any explanation. Do you mind 
sharing?

> Dynamically add jar into Spark Driver's classpath.
> --
>
> Key: SPARK-5377
> URL: https://issues.apache.org/jira/browse/SPARK-5377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>
> Spark support dynamically add jar to executor classpath through 
> SparkContext::addJar(), while it does not support dynamically add jar into 
> driver classpath. In most case(if not all the case), user dynamically add jar 
> with SparkContext::addJar()  because some classes from the jar would be 
> referred in upcoming Spark job, which means the classes need to be loaded in 
> Spark driver side either,e.g during serialization. I think it make sense to 
> add an API to add jar into driver classpath, or just make it available in 
> SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.

2016-02-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148662#comment-15148662
 ] 

Thomas Graves commented on SPARK-12316:
---

I'm not following how this ended up in an infinite loop.  Can you please 
describe exactly what you are seeing?

for instance, shutdown is happening you happen to hit 
updateCredentialsIfRequired.  But if the File isn't found you would get an 
exception and fall back to schedule it an hour later in the catch NonFatal.  If 
the stop was already called then delegationTokenRenewer.shutdown() should have 
happened and I assume schedule would have thrown (perhaps I'm wrong here).

 

> Stack overflow with endless call of `Delegation token thread` when 
> application end.
> ---
>
> Key: SPARK-12316
> URL: https://issues.apache.org/jira/browse/SPARK-12316
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Attachments: 20151210045149.jpg, 20151210045533.jpg
>
>
> When application end, AM will clean the staging dir.
> But if the driver trigger to update the delegation token, it will can't find 
> the right token file and then it will endless cycle call the method 
> 'updateCredentialsIfRequired'.
> Then it lead to StackOverflowError.
> !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg!
> !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4224) Support group acls

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148653#comment-15148653
 ] 

Sean Owen commented on SPARK-4224:
--

No follow up in 16 months. I realize that doesn't automatically qualify 
something for closing, but since I've seen no other mention of this or demand 
for this enhancement, and no reason to expect activity, I prefer to reflect 
reality here. Of course, it can be reopened if someone wants to implement it. 
And, the existence of dupes have never stopped anyone from making new JIRAs.

[~tgraves] I won't get into a Reopen/Close battle here of course, but unless 
you intend to work on this, I strongly disagree with reopening it.

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4224) Support group acls

2016-02-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reopened SPARK-4224:
--

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4224) Support group acls

2016-02-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148627#comment-15148627
 ] 

Thomas Graves commented on SPARK-4224:
--

why did you close this?

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13340) [ML] PolynomialExpansion and Normalizer should validate input type

2016-02-16 Thread Grzegorz Chilkiewicz (JIRA)

Grzegorz Chilkiewicz created SPARK-13340:


 Summary: [ML] PolynomialExpansion and Normalizer should validate 
input type
 Key: SPARK-13340
 URL: https://issues.apache.org/jira/browse/SPARK-13340
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.6.0, 2.0.0
Reporter: Grzegorz Chilkiewicz
Priority: Trivial


PolynomialExpansion and Normalizer should override 
UnaryTransformer::validateInputType

Now, in case of trying to operate on String column:
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.mllib.linalg.Vector
is thrown, but:
java.lang.IllegalArgumentException: requirement failed: Input type must be 
VectorUDT but got StringType
will be more clear and adequate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13198.
---
Resolution: Not A Problem

I don't think this is a problem since the intended usage is 1 context == 1 JVM. 
The JVM terminates when it finishes. (Still if you found some obvious, cheap 
way to release whatever is referencing this memory, it might not hurt to make 
that change.)

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13339) Clarify commutative / associative operator requirements for reduce, fold

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13339:


Assignee: Apache Spark  (was: Sean Owen)

> Clarify commutative / associative operator requirements for reduce, fold
> 
>
> Key: SPARK-13339
> URL: https://issues.apache.org/jira/browse/SPARK-13339
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> As begun in https://github.com/apache/spark/pull/11091 we should make 
> consistent the documentation for the function supplied to various reduce and 
> fold methods. reduce needs associative and commutative; fold needs 
> associative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13339) Clarify commutative / associative operator requirements for reduce, fold

2016-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148595#comment-15148595
 ] 

Apache Spark commented on SPARK-13339:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11217

> Clarify commutative / associative operator requirements for reduce, fold
> 
>
> Key: SPARK-13339
> URL: https://issues.apache.org/jira/browse/SPARK-13339
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> As begun in https://github.com/apache/spark/pull/11091 we should make 
> consistent the documentation for the function supplied to various reduce and 
> fold methods. reduce needs associative and commutative; fold needs 
> associative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13339) Clarify commutative / associative operator requirements for reduce, fold

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13339:


Assignee: Sean Owen  (was: Apache Spark)

> Clarify commutative / associative operator requirements for reduce, fold
> 
>
> Key: SPARK-13339
> URL: https://issues.apache.org/jira/browse/SPARK-13339
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> As begun in https://github.com/apache/spark/pull/11091 we should make 
> consistent the documentation for the function supplied to various reduce and 
> fold methods. reduce needs associative and commutative; fold needs 
> associative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13339) Clarify commutative / associative operator requirements for reduce, fold

2016-02-16 Thread Sean Owen (JIRA)

Sean Owen created SPARK-13339:
-

 Summary: Clarify commutative / associative operator requirements 
for reduce, fold
 Key: SPARK-13339
 URL: https://issues.apache.org/jira/browse/SPARK-13339
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.6.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


As begun in https://github.com/apache/spark/pull/11091 we should make 
consistent the documentation for the function supplied to various reduce and 
fold methods. reduce needs associative and commutative; fold needs associative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12714) Transforming Dataset with sequences of case classes to RDD causes Task Not Serializable exception

2016-02-16 Thread James Eastwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Eastwood resolved SPARK-12714.

Resolution: Fixed

[~marmbrus] Sorry for taking an age to get back to you -- I've tested this with 
1.6.0-SNAPSHOT and it is indeed working. Thanks :).

> Transforming Dataset with sequences of case classes to RDD causes Task Not 
> Serializable exception
> -
>
> Key: SPARK-12714
> URL: https://issues.apache.org/jira/browse/SPARK-12714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: linux 3.13.0-24-generic, scala 2.10.6
>Reporter: James Eastwood
>
> Attempting to transform a Dataset of a case class containing a nested 
> sequence of case classes causes an exception to be thrown: 
> `org.apache.spark.SparkException: Task not serializable`.
> Here is a minimum repro:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class Top(a: String, nested: Array[Nested])
> case class Nested(b: String)
> object scratch {
>   def main ( args: Array[String] ) {
> lazy val sparkConf = new 
> SparkConf().setAppName("scratch").setMaster("local[1]")
> lazy val sparkContext = new SparkContext(sparkConf)
> lazy val sqlContext = new SQLContext(sparkContext)
> val input = List(
>   """{ "a": "123", "nested": [{ "b": "123" }] }"""
> )
> import sqlContext.implicits._
> val ds = sqlContext.read.json(sparkContext.parallelize(input)).as[Top]
> ds.rdd.foreach(println)
> sparkContext.stop()
>   }
> }
> {code}
> {code}
> scalaVersion := "2.10.6"
> lazy val sparkVersion = "1.6.0"
> libraryDependencies ++= List(
>   "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
> )
> {code}
> Full stack trace:
> {code}
> [error] (run-main-0) org.apache.spark.SparkException: Task not serializable
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at org.apache.spark.sql.Dataset.rdd(Dataset.scala:166)
>   at scratch$.main(scratch.scala:26)
>   at scratch.main(scratch.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: java.io.NotSerializableException: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$, value: package 
> )
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, )
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$TypeRef$$anon$6, Nested)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  name: elementType$1, type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  )
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2$$anonfun$apply$1,
>  name: $outer, type: class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2)
>   - object (class 
>

[jira] [Resolved] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12247.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10411
[https://github.com/apache/spark/pull/10411]

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Timothy Hunter
>Assignee: Benjamin Fradet
> Fix For: 2.0.0
>
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13338) [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13338:


Assignee: (was: Apache Spark)

> [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
> --
>
> Key: SPARK-13338
> URL: https://issues.apache.org/jira/browse/SPARK-13338
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>
> PolynomialExpansion has bug in validation of 'degree' parameter.
> It does not allow setting degree to 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13338) [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13338:


Assignee: Apache Spark

> [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
> --
>
> Key: SPARK-13338
> URL: https://issues.apache.org/jira/browse/SPARK-13338
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Assignee: Apache Spark
>
> PolynomialExpansion has bug in validation of 'degree' parameter.
> It does not allow setting degree to 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13338) [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion

2016-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148525#comment-15148525
 ] 

Apache Spark commented on SPARK-13338:
--

User 'grzegorz-chilkiewicz' has created a pull request for this issue:
https://github.com/apache/spark/pull/11216

> [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
> --
>
> Key: SPARK-13338
> URL: https://issues.apache.org/jira/browse/SPARK-13338
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>
> PolynomialExpansion has bug in validation of 'degree' parameter.
> It does not allow setting degree to 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13338) [ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion

2016-02-16 Thread Grzegorz Chilkiewicz (JIRA)

Grzegorz Chilkiewicz created SPARK-13338:


 Summary: [ML] Allow setting 'degree' parameter to 1 for 
PolynomialExpansion
 Key: SPARK-13338
 URL: https://issues.apache.org/jira/browse/SPARK-13338
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.6.0, 2.0.0
Reporter: Grzegorz Chilkiewicz


PolynomialExpansion has bug in validation of 'degree' parameter.
It does not allow setting degree to 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-02-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148472#comment-15148472
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Thanks, the {{registerTempTable}} + {{sql}} workaround is fine for now, I guess 
I'll just wait for 2.0 to clean that up.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Scanned image from cop...@spark.apache.org

2016-02-16 Thread copier@

Reply to: cop...@spark.apache.org 
Device Name: COPIER
Device Model: MX-2310U

File Format: XLS (Medium)
Resolution: 200dpi x 200dpi

Attached file is scanned document in XLS format.
Use Microsoft(R)Excel(R) of Microsoft Systems Incorporated to view the document.


copier@spark.apache.org_20160216_084903.xls
Description: MS-Excel spreadsheet

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-02-16 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148463#comment-15148463
 ] 

Herman van Hovell commented on SPARK-9740:
--

[~emlyn] We have merged a patch for this, see: 
https://issues.apache.org/jira/browse/SPARK-13049. Things like this typically 
don't get backported; since it is more a feature than a bug.

You could still use the sql workarround. You could also create some glue code, 
i.e. put the contents of the merged PR in a different object 
(ExtendedFunctions?) in the org.apache.spark.sql package. I sometimes do this 
when I need such functionality.


> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6622) Spark SQL cannot communicate with Hive meta store

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6622.
--
Resolution: Not A Problem

This just looks like you're missing a MySQL driver. It doesn't look 
Spark-related

> Spark SQL cannot communicate with Hive meta store
> -
>
> Key: SPARK-6622
> URL: https://issues.apache.org/jira/browse/SPARK-6622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Deepak Kumar V
>  Labels: Hive
> Attachments: exception.txt
>
>
> I have multiple tables (among them is dw_bid) that are created through Apache 
> Hive.  I have data in avro on HDFS that i want to join with dw_bid table, 
> this join needs to be done using Spark SQL.  
> Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
> exception
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://hostname.vip. company.com:3306/HDB
>   at java.sql.DriverManager.getConnection(DriverManager.java:596)
> Spark Submit Command
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path 
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>  --jars 
> /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
>  --num-executors 1 --driver-memory 4g --driver-java-options 
> "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
> hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
> spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
> MySQL Java Conector Versions tried
> mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
> folder)
> mysql-connector-java-5.1.34.jar
> mysql-connector-java-5.1.35.jar
> Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
> (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)
> $ hive --version
> Hive 0.13.0.2.1.3.6-2
> Subversion 
> git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
>  -r 87da9430050fb9cc429d79d95626d26ea382b96c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-02-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148458#comment-15148458
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Any update on this? Should I open a new issue for it so it doesn't fall through 
the cracks?

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13337:
--
Component/s: SQL

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-16 Thread Zhong Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhong Wang updated SPARK-13337:
---
Description: 
Currently, the join-on-columns function:
{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
performs a null-insafe join. It would be great if there is an option for 
null-safe join.

  was:
Currently, the join-on-columns function:

def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
DataFrame

performs a null-insafe join. It would be great if there is an option for 
null-safe join.


> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-16 Thread Zhong Wang (JIRA)

Zhong Wang created SPARK-13337:
--

 Summary: DataFrame join-on-columns function should support 
null-safe equal
 Key: SPARK-13337
 URL: https://issues.apache.org/jira/browse/SPARK-13337
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Zhong Wang
Priority: Minor


Currently, the join-on-columns function:

def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
DataFrame

performs a null-insafe join. It would be great if there is an option for 
null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5855) [Spark SQL] 'explain' command in SparkSQL don't support to analyze the DDL 'VIEW'

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5855.
--
Resolution: Won't Fix

> [Spark SQL] 'explain' command in SparkSQL don't support to analyze the DDL 
> 'VIEW' 
> --
>
> Key: SPARK-5855
> URL: https://issues.apache.org/jira/browse/SPARK-5855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
>Priority: Minor
>
> 'explain' command in SparkSQL don't support to analyze the DDL 'VIEW'. For 
> example below in Spark-SQL CLI:
>  > explain
>  > CREATE VIEW q24_spark_RUN_QUERY_0_temp_competitor_price_view AS
>  > SELECT
>  >   i_item_sk, (imp_competitor_price - 
> i_current_price)/i_current_price AS price_change,
>  >   imp_start_date, (imp_end_date - imp_start_date) AS no_days
>  > FROM item i
>  > JOIN item_marketprices imp ON i.i_item_sk = imp.imp_item_sk
>  > WHERE i.i_item_sk IN (7, 17)
>  > AND imp.imp_competitor_price < i.i_current_price;
> 15/02/17 14:06:50 WARN HiveConf: DEPRECATED: Configuration property 
> hive.metastore.local no longer has any effect. Make sure to provide a valid 
> value for hive.metastore.uris if you are connecting to a remote metastore.
> 15/02/17 14:06:50 INFO ParseDriver: Parsing command: explain
> CREATE VIEW q24_spark_RUN_QUERY_0_temp_competitor_price_view AS
> SELECT
>   i_item_sk, (imp_competitor_price - i_current_price)/i_current_price AS 
> price_change,
>   imp_start_date, (imp_end_date - imp_start_date) AS no_days
> FROM item i
> JOIN item_marketprices imp ON i.i_item_sk = imp.imp_item_sk
> WHERE i.i_item_sk IN (7, 17)
> AND imp.imp_competitor_price < i.i_current_price
> 15/02/17 14:06:50 INFO ParseDriver: Parse Completed
> 15/02/17 14:06:50 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/17 14:06:50 INFO DAGScheduler: Got job 3 (collect at 
> SparkPlan.scala:84) with 1 output partitions (allowLocal=false)
> 15/02/17 14:06:50 INFO DAGScheduler: Final stage: Stage 3(collect at 
> SparkPlan.scala:84)
> 15/02/17 14:06:50 INFO DAGScheduler: Parents of final stage: List()
> 15/02/17 14:06:50 INFO DAGScheduler: Missing parents: List()
> 15/02/17 14:06:50 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[12] at map 
> at SparkPlan.scala:84), which has no missing parents
> 15/02/17 14:06:50 INFO MemoryStore: ensureFreeSpace(2560) called with 
> curMem=4122, maxMem=370503843
> 15/02/17 14:06:50 INFO MemoryStore: Block broadcast_5 stored as values in 
> memory (estimated size 2.5 KB, free 353.3 MB)
> 15/02/17 14:06:50 INFO MemoryStore: ensureFreeSpace(1562) called with 
> curMem=6682, maxMem=370503843
> 15/02/17 14:06:50 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes 
> in memory (estimated size 1562.0 B, free 353.3 MB)
> 15/02/17 14:06:50 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory 
> on bignode1:56237 (size: 1562.0 B, free: 353.3 MB)
> 15/02/17 14:06:50 INFO BlockManagerMaster: Updated info of block 
> broadcast_5_piece0
> 15/02/17 14:06:50 INFO SparkContext: Created broadcast 5 from broadcast at 
> DAGScheduler.scala:838
> 15/02/17 14:06:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 
> (MappedRDD[12] at map at SparkPlan.scala:84)
> 15/02/17 14:06:50 INFO YarnClientClusterScheduler: Adding task set 3.0 with 1 
> tasks
> 15/02/17 14:06:50 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, 
> bignode2, PROCESS_LOCAL, 2425 bytes)
> 15/02/17 14:06:50 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory 
> on bignode2:51446 (size: 1562.0 B, free: 1766.5 MB)
> 15/02/17 14:06:51 INFO DAGScheduler: Stage 3 (collect at SparkPlan.scala:84) 
> finished in 0.147 s
> 15/02/17 14:06:51 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) 
> in 135 ms on bignode2 (1/1)
> 15/02/17 14:06:51 INFO DAGScheduler: Job 3 finished: collect at 
> SparkPlan.scala:84, took 0.164711 s
> 15/02/17 14:06:51 INFO YarnClientClusterScheduler: Removed TaskSet 3.0, whose 
> tasks have all completed, from pool
> == Physical Plan ==
> PhysicalRDD [], ParallelCollectionRDD[4] at parallelize at 
> SparkStrategies.scala:195
> Time taken: 0.292 seconds
> 15/02/17 14:06:51 INFO CliDriver: Time taken: 0.292 seconds  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6002) MLLIB should support the RandomIndexing transform

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6002.
--
Resolution: Won't Fix

> MLLIB should support the RandomIndexing transform
> -
>
> Key: SPARK-6002
> URL: https://issues.apache.org/jira/browse/SPARK-6002
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Derrick Burns
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> MLLIB offers the HashingTF.  However, this simple transform offers no 
> guarantees on the relationship between the input and the output. 
> Instead of the HashingTF, MLLIB should offer Random Indexing 
> (http://en.wikipedia.org/wiki/Random_indexing) which does offer such 
> guarantees.
> The K-means clusterer at 
> https://github.com/derrickburns/generalized-kmeans-clustering includes an 
> implementation of the Random Indexing transform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5377.
--
Resolution: Won't Fix

> Dynamically add jar into Spark Driver's classpath.
> --
>
> Key: SPARK-5377
> URL: https://issues.apache.org/jira/browse/SPARK-5377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>
> Spark support dynamically add jar to executor classpath through 
> SparkContext::addJar(), while it does not support dynamically add jar into 
> driver classpath. In most case(if not all the case), user dynamically add jar 
> with SparkContext::addJar()  because some classes from the jar would be 
> referred in upcoming Spark job, which means the classes need to be loaded in 
> Spark driver side either,e.g during serialization. I think it make sense to 
> add an API to add jar into driver classpath, or just make it available in 
> SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5590) Create a complete reference of configurable environment variables, config files and command-line parameters

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5590.
--
Resolution: Won't Fix

> Create a complete reference of configurable environment variables, config 
> files and command-line parameters
> ---
>
> Key: SPARK-5590
> URL: https://issues.apache.org/jira/browse/SPARK-5590
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
> Environment: All
>Reporter: Tobias Bertelsen
>
> This originated as [a question on a 
> stackoverflow|http://stackoverflow.com/q/28219279/]
> It will be great with a complete reference of the different ways of 
> configuring spark master and workers – especially different names of the same 
> parameter and the precedence of different ways of configuring the same thing.
> From the original stackoverflow question:
> h2. Known resources
>  - [The standalone 
> documentation|http://spark.apache.org/docs/1.2.0/spark-standalone.html] is 
> the best I have found, but it does not clearly describes relationships 
> between different variables/parameters nor which take precedence over other.
>  - [The configuration 
> documentation|http://spark.apache.org/docs/1.2.0/configuration.html] provides 
> a good overview for application-properties, but not for the master/slave 
> launch-time parameters.
> h2. Example problem
> The [standalone 
> documentation|http://spark.apache.org/docs/1.2.0/spark-standalone.html] 
> writes the following:
> {quote}
>  the following configuration options can be passed to the master and worker
>  ...
>  `-d DIR, --work-dir DIR` Directory to use for scratch space and job 
> output logs (default: SPARK_HOME/work); only on worker
> {quote}
> and later
> {quote}
>  `SPARK_LOCAL_DIRS` Directory to use for "scratch" space in Spark
>  `SPARK_WORKER_DIR` Directory to run applications in, which will include both 
> logs and scratch space (default: SPARK_HOME/work).
> {quote}
> As a spark-newbe I am a little confused by now. 
>  - What is the relationship between `SPARK_LOCAL_DIRS`, `SPARK_WORKER_DIR`, 
> and `-d`.  
>  - What if I specify them all to different values – which takes precedence.
>  - Does variables written in `$SPARK_HOME/conf/spark-env.sh` take precedence 
> over variable defined in the shell/script starting spark?
> h2. Ideal Solution
> What I am looking for is esentially a single reference, that
>  1. defines the precedence of different ways of specifying variables for 
> spark and
>  2. lists all variables/parameters.
> For example something like this:
> || Varialble || Cmd-line  || Default  || Description ||
>  | SPARK_MASTER_PORT | -p --port | 8080 | Port for master to 
> listen on |
>  | SPARK_SLAVE_PORT  | -p --port | random   | Port for slave to 
> listen on |
>  | SPARK_WORKER_DIR  | -d --dir  | $SPARK_HOME/work | Used as default for 
> worker data  |
>  | SPARK_LOCAL_DIRS  |   | $SPARK_WORKER_DIR| Scratch space for RDD's 
> |
>  |   |   |  |  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2421) Spark should treat writable as serializable for keys

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2421.
--
Resolution: Won't Fix

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5024) Provide duration summary by job group ID

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5024.
--
Resolution: Won't Fix

> Provide duration summary by job group ID
> 
>
> Key: SPARK-5024
> URL: https://issues.apache.org/jira/browse/SPARK-5024
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Eran Medan
>
> Right now one has to manually sum all jobs pertaining to the same job group 
> ID (e.g. to get stats on how long a certain job group took). It will be very 
> nice to have a summary of duration by group ID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4776) Spark IO Messages are difficult to Debug

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4776.
--
Resolution: Won't Fix

> Spark IO Messages are difficult to Debug
> 
>
> Key: SPARK-4776
> URL: https://issues.apache.org/jira/browse/SPARK-4776
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kevin Mader
>
> With messages like this it is very difficult to determine which RDDs are 
> causing the problem and which task is being applied to them at the time of 
> crash. It would be helpful if the RDD's ID and Task ID were shown to make 
> tracking down the source of a problem easier.
> ```
> 14/12/06 12:10:26 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file 
> /scratch/spark-local-20141206120823-caaa/1d/merged_shuffle_0_254_3
> java.io.FileNotFoundException: 
> /scratch/spark-local-20141206120823-caaa/1d/merged_shuffle_0_254_3 (No such 
> file or directory)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5318) Add ability to control partition count in SparkSql

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5318.
--
Resolution: Won't Fix

> Add ability to control partition count in SparkSql
> --
>
> Key: SPARK-5318
> URL: https://issues.apache.org/jira/browse/SPARK-5318
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Idan Zalzberg
>
> When using SparkSql, e.g. sqlContext.sql("..."), spark might need to read 
> hadoop files.
> However, unlike the hadoopFile API, there is no documented way to set the 
> minimal partition count when reading.
> There is an undocumented way, though, using "mapred.map.tasks" in hiveConf
> I suggest we make a documented way to do it, in the exact same way (possibly 
> with a better name)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5103) Add Functionality to Pass Config Options to KeyConverter and ValueConverter in PySpark

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5103.
--
Resolution: Won't Fix

> Add Functionality to Pass Config Options to KeyConverter and ValueConverter 
> in PySpark
> --
>
> Key: SPARK-5103
> URL: https://issues.apache.org/jira/browse/SPARK-5103
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Brett Meyer
>Priority: Minor
>  Labels: features
>
> Currently when using the provided PySpark loaders and using a KeyConverter or 
> ValueConverter class, there is no way to pass in additional information to 
> the converter classes.  Would like to add functionality to pass in options 
> either through configuration that can be set to the SparkContext, or through 
> parameters that can be passed to the KeyConverter and ValueConverter classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5008.
--
Resolution: Won't Fix

> Persistent HDFS does not recognize EBS Volumes
> --
>
> Key: SPARK-5008
> URL: https://issues.apache.org/jira/browse/SPARK-5008
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.2.0
> Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
> -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
> --ebs-vol-num 1
>Reporter: Brad Willard
>
> Cluster is built with correct size EBS volumes. It creates the volume at 
> /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
> with start-all script, it starts but it isn't correctly configured to use the 
> EBS volume.
> I'm assuming some sym links or expected mounts are not correctly configured.
> This has worked flawlessly on all previous versions of spark.
> I have a stupid workaround by installing pssh and mucking with it by mounting 
> it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4474) Improve handling of jars that cannot be included in the uber jar

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4474.
--
Resolution: Won't Fix

> Improve handling of jars that cannot be included in the uber jar
> 
>
> Key: SPARK-4474
> URL: https://issues.apache.org/jira/browse/SPARK-4474
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Jim Lim
>Priority: Minor
>
> Please refer to this [pull request|https://github.com/apache/spark/pull/3238] 
> for more details.
> Some jars, such as the datanucleus jars, are distributed with spark but 
> cannot be included in the uber (aka assembly) jar. This caused some classpath 
> issues, such as [SPARK-2624]. A workaround was done in the above pull 
> request, allowing the user to specify the location of the datanucleus jars in 
> {{spark.yarn.datanucleus.dir}}.
> Some things to consider:
> - rename {{spark.yarn.datanucleus.dir}} to be more generic to deal with such 
> jars in the future
> - figure out how to include datanucleus jars in the uber jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4528) [SQL] add comment support for Spark SQL CLI

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4528.
--
Resolution: Won't Fix

> [SQL] add comment support for Spark SQL CLI 
> 
>
> Key: SPARK-4528
> URL: https://issues.apache.org/jira/browse/SPARK-4528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Fuqing Yang
>Priority: Minor
>  Labels: features
>
> while using spark-sql, found it does not support comment while write sqls. 
> It returns an error if we add some comments  
> for examples:
> spark-sql> 
>  > show tables; --list tables in current db;
> 14/11/21 11:30:37 INFO parse.ParseDriver: Parsing command: show tables
> 14/11/21 11:30:37 INFO parse.ParseDriver: Parse Completed
> 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> ..
> 14/11/21 11:30:38 INFO HiveMetaStore.audit: ugi=hadoop
> ip=unknown-ip-addr  cmd=get_tables: db=default pat=.*   
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540638191 end=1416540638202 duration=11>
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540638191 end=1416540638202 duration=11>
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540638190 end=1416540638203 duration=13>
> OK
> 14/11/21 11:30:38 INFO ql.Driver: OK
> 14/11/21 11:30:38 INFO ql.Driver: 
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540638203 end=1416540638203 duration=0>
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540637998 end=1416540638203 duration=205>
> 14/11/21 11:30:38 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 14/11/21 11:30:38 INFO ql.Driver: 
> 14/11/21 11:30:38 INFO ql.Driver:  start=1416540638207 end=1416540638208 duration=1>
> 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch Check Analysis
> 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for 
> batch Add exchange
> 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for 
> batch Prepare Expressions
> dummy
> records
> tab_sogou10
> Time taken: 0.412 seconds
> 14/11/21 11:30:38 INFO CliDriver: Time taken: 0.412 seconds
> 14/11/21 11:30:38 INFO parse.ParseDriver: Parsing command:  --list tables in 
> current db
> NoViableAltException(-1@[])
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:902)
> Comment support is widely using in projects, it is a necessary to add this 
> feature.
> this implitation can be archived in  the source 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.scala.
> We may need support three comment stypes:
> From a ‘#’ character to the end of the line.
> From a ‘-- ’ sequence to the end of the line
> From a /* sequence to the following */ sequence, as in the C programming 
> language.
> This syntax allows a comment to extend over multiple lines because the 
> beginning and closing sequences need not be on the same line.
> what do you think about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4224) Support group acls

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4224.
--
Resolution: Won't Fix

> Support group acls
> --
>
> Key: SPARK-4224
> URL: https://issues.apache.org/jira/browse/SPARK-4224
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> Currently we support view and modify acls but you have to specify a list of 
> users. It would be nice to also support groups, so that anyone in the group 
> has permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2369) Enable Spark SQL UDF to influence at runtime the decision to read a partition

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2369.
--
Resolution: Won't Fix

> Enable Spark SQL UDF to influence at runtime the decision to read a partition
> -
>
> Key: SPARK-2369
> URL: https://issues.apache.org/jira/browse/SPARK-2369
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Mansour Raad
>  Labels: UDF
>
> Let's say I have a custom partitioner on my RDD - and that RDD is registered 
> as a SQL table and want to do a "select myfield from mytable where 
> myudf(myfield,"some condition") = somevalue - I do not want to perform a 
> "full table" scan to get myfield.
> However, if the UDF API is extended to say at runtime "ask" where the current 
> partition is "valid" - then it will scan it.
> I see the UDF API been modified with a method such as:
> readPartition(partitioner:Partitioner, partitionId:int):Boolean
> where I can cast partitioner to my own custom one and based on the given 
> partition id and runtime arguments, the method will decide to read that 
> partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13331) Spark network encryption optimization

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13331:
--
   Priority: Minor  (was: Major)
Component/s: Deploy

> Spark network encryption optimization
> -
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL encryption uses DIGEST-MD5 mechanism, which supports: 
> 3DES, DES, and RC4
> 3des and rc4 are slow relatively. We could make it support AES for more 
> secure and performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13317) SPARK_LOCAL_IP does not bind to public IP on Slaves

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13317:
--
   Priority: Minor  (was: Major)
Component/s: EC2
 Deploy
Summary: SPARK_LOCAL_IP does not bind to public IP on Slaves  (was: 
SPARK_LOCAL_IP does not bind on Slaves)

> SPARK_LOCAL_IP does not bind to public IP on Slaves
> ---
>
> Key: SPARK-13317
> URL: https://issues.apache.org/jira/browse/SPARK-13317
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, EC2
> Environment: Linux EC2, different VPC 
>Reporter: Christopher Bourez
>Priority: Minor
>
> SPARK_LOCAL_IP does not bind to the provided IP on slaves.
> When launching a job or a spark-shell from a second network, the returned IP 
> for the slave is still the first IP of the slave. 
> So the job fails with the message : 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> It is not a question of resources but the driver which cannot connect to the 
> slave given the wrong IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12921) Use SparkHadoopUtil reflection to access TaskAttemptContext in SpecificParquetRecordReaderBase

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12921.
---
Resolution: Fixed

> Use SparkHadoopUtil reflection to access TaskAttemptContext in 
> SpecificParquetRecordReaderBase
> --
>
> Key: SPARK-12921
> URL: https://issues.apache.org/jira/browse/SPARK-12921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.1
>
>
> It looks like there's one place left in the codebase, 
> SpecificParquetRecordReaderBase,  where we didn't use SparkHadoopUtil's 
> reflective accesses of TaskAttemptContext methods, creating problems when 
> using a single Spark artifact with both Hadoop 1.x and 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13221:
--
Assignee: Xiao Li

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2016-02-16 Thread Christoph Pirkl (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148366#comment-15148366
 ] 

Christoph Pirkl commented on SPARK-10969:
-

I created PR https://github.com/apache/spark/pull/11215 that implements this.

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2016-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148365#comment-15148365
 ] 

Apache Spark commented on SPARK-10969:
--

User 'kaklakariada' has created a pull request for this issue:
https://github.com/apache/spark/pull/11215

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10969:


Assignee: Apache Spark

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Assignee: Apache Spark
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10969) Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis and DynamoDB

2016-02-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10969:


Assignee: (was: Apache Spark)

> Spark Streaming Kinesis: Allow specifying separate credentials for Kinesis 
> and DynamoDB
> ---
>
> Key: SPARK-10969
> URL: https://issues.apache.org/jira/browse/SPARK-10969
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Christoph Pirkl
>Priority: Critical
>
> {{KinesisUtils.createStream()}} allows specifying only one set of AWS 
> credentials that will be used by Amazon KCL for accessing Kinesis, DynamoDB 
> and CloudWatch.
> h5. Motivation
> In a scenario where one needs to read from a Kinesis Stream owned by a 
> different AWS account the user usually has minimal rights (i.e. only read 
> from the stream). In this case creating the DynamoDB table in KCL will fail.
> h5. Proposal
> My proposed solution would be to allow specifying multiple credentials in 
> {{KinesisUtils.createStream()}} for Kinesis, DynamoDB and CloudWatch. The 
> additional credentials could then be passed to the constructor of 
> {{KinesisClientLibConfiguration}} or method 
> {{KinesisClientLibConfiguration.withDynamoDBClientConfig()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-16 Thread Sateesh Babu G (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148342#comment-15148342
 ] 

Sateesh Babu G commented on SPARK-9273:
---

Hi All,

Is there any CNN implementation for SPARK? 

Thanks in advance!

Best,
Sateesh

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13311) prettyString of IN is not good

2016-02-16 Thread Jayadevan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148318#comment-15148318
 ] 

Jayadevan M commented on SPARK-13311:
-

[~davies] I would like to work on this issue. Could you tell the steps to 
replicate this issue

> prettyString of IN is not good
> --
>
> Key: SPARK-13311
> URL: https://issues.apache.org/jira/browse/SPARK-13311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> In(i_class,[Ljava.lang.Object;@1a575883))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13331) Spark network encryption optimization

2016-02-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148314#comment-15148314
 ] 

Sean Owen commented on SPARK-13331:
---

This sounds like a duplicate of SPARK-10771. You should provide some detail on 
what you think would implement this change.

> Spark network encryption optimization
> -
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dong Chen
>
> In network/common, SASL encryption uses DIGEST-MD5 mechanism, which supports: 
> 3DES, DES, and RC4
> 3des and rc4 are slow relatively. We could make it support AES for more 
> secure and performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-16 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148205#comment-15148205
 ] 

dylanzhou edited comment on SPARK-13183 at 2/16/16 8:36 AM:


@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount 
of data that flows into Kafka, memory consumption and faster。 Can you give me 
some advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html



was (Author: dylanzhou):
@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC. Can you give me some 
advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html


> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2

101 - 170 of 170 matches

Mail list logo