date:20170126

[jira] [Created] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-26 Thread Brent Dorsey (JIRA)

Brent Dorsey created SPARK-19383:


 Summary: Spark Sql Fails with Cassandra 3.6 and later PER 
PARTITION LIMIT option 
 Key: SPARK-19383
 URL: https://issues.apache.org/jira/browse/SPARK-19383
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: PER PARTITION LIMIT Error documented in github and 
reproducible by cloning: 
[BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]

Java 1.8

Cassandra Version
[cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]

{code:title=POM.xml|borderStyle=solid}

com.datastax.spark
spark-cassandra-connector_2.10
2.0.0-M3


com.datastax.cassandra
cassandra-driver-mapping
3.1.2


org.apache.hadoop
hadoop-common
2.72
compile


org.apache.spark
spark-catalyst_2.10
2.0.2
compile


org.apache.spark
spark-core_2.10
2.0.2
compile


org.apache.spark
spark-sql_2.10
2.0.2
compile

{code}
Reporter: Brent Dorsey
Priority: Minor


Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector to 
select the most recent version of each partition key using the Cassandra 3.6 
and later PER PARTITION LIMIT option fails. I've tried using all the Cassandra 
Java RDD's and Spark Sql with and without partition key equality constraints. 
All attempts have failed due to syntax errors and/or start/end bound 
restriction errors.

The 
[BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
 repo contains working code that demonstrates the error. Clone the repo, create 
the keyspace and table locally and supply connection information then run main.


Spark Dataset .where & Spark Sql Errors:
{code:title=errors|borderStyle=solid}
ERROR [2017-01-27 06:35:19,919] (main) 
org.per.partition.limit.test.spark.job.Main: 
getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'PARTITION' expecting (line 1, pos 67)

== SQL ==
TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
LIMIT 1
---^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
at 
org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
ERROR [2017-01-27 06:35:20,238] (main) 
org.per.partition.limit.test.spark.job.Main: 
getSparkSqlDatasetPerPartitionLimitTest failed.
org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input ''' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 
'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 
'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET'

[jira] [Resolved] (SPARK-18929) Add Tweedie distribution in GLM

2017-01-26 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18929.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
> Fix For: 2.2.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-01-26 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-19382:
-

 Summary: Test sparse vectors in LinearSVCSuite
 Key: SPARK-19382
 URL: https://issues.apache.org/jira/browse/SPARK-19382
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
recommend that generateSVMInput be modified to create a mix of dense and sparse 
vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2017-01-26 Thread Gaurav Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841011#comment-15841011
 ] 

Gaurav Shah commented on SPARK-9215:


[~tdas] ping

> Implement WAL-free Kinesis receiver that give at-least once guarantee
> -
>
> Key: SPARK-9215
> URL: https://issues.apache.org/jira/browse/SPARK-9215
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>
> Currently, the KinesisReceiver can loose some data in the case of certain 
> failures (receiver and driver failures). Using the write ahead logs can 
> mitigate some of the problem, but it is not ideal because WALs dont work with 
> S3 (eventually consistency, etc.) which is the most likely file system to be 
> used in the EC2 environment. Hence, we have to take a different approach to 
> improving reliability for Kinesis.
> Detailed design doc - 
> https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-01-26 Thread Gaurav Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841009#comment-15841009
 ] 

Gaurav Shah commented on SPARK-19304:
-

There are two issues in `KinesisSequenceRangeIterator.getNext`

* when `internalIterator` has exhausted it makes call to `getRecords` which in 
turn makes two api call , one to get shardIterator & other to get records. This 
can be avoided by storing the nextIterator sequence number returned from 
`getRecordsAndNextKinesisIterator`

* Second issue is more complicated where each checkpoint block is used as a 
separate api call. Consider there is one shard and 10 checkpoint blocks. for 
each block ( or range) we invoke `new KinesisSequenceRangeIterator`, which 
makes separate `getRecords` call. This api call might return much more records 
than what is actually required by this range. which is then wasted. Now for 
second range we again call  `new KinesisSequenceRangeIterator` which might have 
records returned from first api call itself.

Example:
ranges \[1-10\],\[11-20\],\[21-30\],\[31-40\]. First "new 
KinesisSequenceRangeIterator" will get "\[1-30\]" records but will ignore all 
records post "10". The next "new KinesisSequenceRangeIterator" will again get 
records from "\[11-40\]" but will make use of only "\[11-20\]"

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>  Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-26 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18788.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.1
Target Version/s: 2.1.1, 2.2.0

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841004#comment-15841004
 ] 

Felix Cheung commented on SPARK-18821:
--

Need to follow up with programming guide, example and vignettes

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18821:
-
Fix Version/s: 2.2.0

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18821) Bisecting k-means wrapper in SparkR

2017-01-26 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18821.
--
Resolution: Fixed
  Assignee: Miao Wang

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Assignee: Miao Wang
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18218:

Assignee: Weichen Xu

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-18218.
-
   Resolution: Implemented
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/15730

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-01-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840872#comment-15840872
 ] 

Hyukjin Kwon commented on SPARK-14480:
--

removed `StringIteratorReader` concatenates the lines in each iterator into 
reader in each partition IIRC.

New line in the column was not supported correctly up to my understanding 
because rows can spawn across multiple blocks. This is a similar problem that 
we have not supported multiple JSON lines before up to my knowledge. 

Currently,  we have some open PRs for dealing with multiple lines support by 
using something like `wholeTextFile` for text or dealing with each file as a 
multiple line json, which I think we could solve this problem in that way if 
any of it is merged.

I guess we introduced several regression or behaviour changes when we porting 
(which I believe are properly told ahead to committers before). 

(Actually, _if I remember this correctly_, I told about this problem several 
times to few committers/PMCs. I can try to find the JIRA or mailing thread if 
anyone feels wanting to verify this.)

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19381) spark 2.1.0 raises unrelated (unhelpful) error for parquet files beginning with '_'

2017-01-26 Thread Paul Pearce (JIRA)

Paul Pearce created SPARK-19381:
---

 Summary: spark 2.1.0 raises unrelated (unhelpful) error for 
parquet files beginning with '_'
 Key: SPARK-19381
 URL: https://issues.apache.org/jira/browse/SPARK-19381
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Paul Pearce
Priority: Minor


Under spark 2.1.0 if you attempt to read a parquet file with filename beginning 
with '_' the error returned is 

"Unable to infer schema for Parquet. It must be specified manually."

The bug is not the inability to read the file, rather that the error is 
unrelated to the actual problem. Below shows the generation of parquet files 
under spark 2.0.0 and the attempted reading of them under spark 2.1.0.

Generation:
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
  /_/

Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: 
>>> Row(single=i, double=i ** 2)))
>>> df.write.parquet("debug.parquet")
>>> df.write.parquet("_debug.parquet")
{code}

Reading
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
SparkSession available as 'spark'.
>>> df = spark.read.parquet("debug.parquet")
>>> df = spark.read.parquet("_debug.parquet")
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 
274, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File 
"/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", 
line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'
{code}

I only realized the source of the problem when reading issue: 
https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar 
problem but with column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14480:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14480:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19381) spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames beginning with '_'

2017-01-26 Thread Paul Pearce (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Pearce updated SPARK-19381:

Summary: spark 2.1.0 raises unrelated (unhelpful) error for parquet 
filenames beginning with '_'  (was: spark 2.1.0 raises unrelated (unhelpful) 
error for parquet files beginning with '_')

> spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames 
> beginning with '_'
> ---
>
> Key: SPARK-19381
> URL: https://issues.apache.org/jira/browse/SPARK-19381
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Paul Pearce
>Priority: Minor
>
> Under spark 2.1.0 if you attempt to read a parquet file with filename 
> beginning with '_' the error returned is 
> "Unable to infer schema for Parquet. It must be specified manually."
> The bug is not the inability to read the file, rather that the error is 
> unrelated to the actual problem. Below shows the generation of parquet files 
> under spark 2.0.0 and the attempted reading of them under spark 2.1.0.
> Generation:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
>   /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import Row
> >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: 
> >>> Row(single=i, double=i ** 2)))
> >>> df.write.parquet("debug.parquet")
> >>> df.write.parquet("_debug.parquet")
> {code}
> Reading
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> df = spark.read.parquet("debug.parquet")
> >>> df = spark.read.parquet("_debug.parquet")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", 
> line 274, in parquet
> return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File 
> "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", 
> line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
> must be specified manually.;'
> {code}
> I only realized the source of the problem when reading issue: 
> https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar 
> problem but with column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-01-26 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840857#comment-15840857
 ] 

Liang-Chi Hsieh commented on SPARK-18539:
-

[~lian cheng] Yea, I see. The term {{optional}} is more proper and won't cause 
misunderstanding. If we can upgrade to 1.8.2, that is great we can remove the 
workaround. The workaround is actually a hacky solution.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint

[jira] [Reopened] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-14480:


This patch have a regression: A column that have escaped newline can't be 
correctly parsed anymore.

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18218:
--
Shepherd: Burak Yavuz  (was: Yanbo Liang)

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19380) YARN - Dynamic allocation should use configured number of executors as max number of executors

2017-01-26 Thread Zhe Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated SPARK-19380:
--
Description: 
 SPARK-13723 only uses user's number of executors as the initial number of 
executors when dynamic allocation is turned on.

If the configured max number of executors is larger than the number of 
executors requested by the user, user's application could continue to request 
for more executors to reach the configured max number if there're tasks backed 
up. This behavior is not very friendly to the cluster if we allow every Spark 
application to reach the max number of executors.

> YARN - Dynamic allocation should use configured number of executors as max 
> number of executors
> --
>
> Key: SPARK-19380
> URL: https://issues.apache.org/jira/browse/SPARK-19380
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Zhe Zhang
>
>  SPARK-13723 only uses user's number of executors as the initial number of 
> executors when dynamic allocation is turned on.
> If the configured max number of executors is larger than the number of 
> executors requested by the user, user's application could continue to request 
> for more executors to reach the configured max number if there're tasks 
> backed up. This behavior is not very friendly to the cluster if we allow 
> every Spark application to reach the max number of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19380) YARN - Dynamic allocation should use configured number of executors as max number of executors

2017-01-26 Thread Zhe Zhang (JIRA)

Zhe Zhang created SPARK-19380:
-

 Summary: YARN - Dynamic allocation should use configured number of 
executors as max number of executors
 Key: SPARK-19380
 URL: https://issues.apache.org/jira/browse/SPARK-19380
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.6.3
Reporter: Zhe Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840785#comment-15840785
 ] 

Apache Spark commented on SPARK-19220:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16717

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19379) SparkAppHandle.getState not registering FAILED state upon Spark app failure in Local mode

2017-01-26 Thread Adam Kramer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated SPARK-19379:

Summary: SparkAppHandle.getState not registering FAILED state upon Spark 
app failure in Local mode  (was: SparkAppHandle.getState not registered FAILED 
state upon Spark app failure in Local mode)

> SparkAppHandle.getState not registering FAILED state upon Spark app failure 
> in Local mode
> -
>
> Key: SPARK-19379
> URL: https://issues.apache.org/jira/browse/SPARK-19379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Kramer
>Priority: Blocker
>
> LocalSchedulerBackend does not handle calling back to the Launcher upon 
> TaskState change. It does send a callback to setState to FINISHED upon 
> stop(). Apps that are FAILED are set as FINISHED in SparkAppHandle.State.
> It looks like a case statement is needed in the statusUpdate() method in 
> LocalSchedulerBacked to call stop( state) or  launcherBackend.setState(state) 
> with the appropriate SparkAppHandle.State for TaskStates FAILED, LAUNCHING, 
> and, possibly, FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19220:
---
Fix Version/s: 2.1.1

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19379) SparkAppHandle.getState not registered FAILED state upon Spark app failure in Local mode

2017-01-26 Thread Adam Kramer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated SPARK-19379:

Summary: SparkAppHandle.getState not registered FAILED state upon Spark app 
failure in Local mode  (was: SparkAppHandle.getState not registered FAILED 
state upon Spark app failure)

> SparkAppHandle.getState not registered FAILED state upon Spark app failure in 
> Local mode
> 
>
> Key: SPARK-19379
> URL: https://issues.apache.org/jira/browse/SPARK-19379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Kramer
>Priority: Blocker
>
> LocalSchedulerBackend does not handle calling back to the Launcher upon 
> TaskState change. It does send a callback to setState to FINISHED upon 
> stop(). Apps that are FAILED are set as FINISHED in SparkAppHandle.State.
> It looks like a case statement is needed in the statusUpdate() method in 
> LocalSchedulerBacked to call stop( state) or  launcherBackend.setState(state) 
> with the appropriate SparkAppHandle.State for TaskStates FAILED, LAUNCHING, 
> and, possibly, FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19379) SparkAppHandle.getState not registered FAILED state upon Spark app failure in Local mode

2017-01-26 Thread Adam Kramer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated SPARK-19379:

Affects Version/s: 2.1.0

> SparkAppHandle.getState not registered FAILED state upon Spark app failure in 
> Local mode
> 
>
> Key: SPARK-19379
> URL: https://issues.apache.org/jira/browse/SPARK-19379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Kramer
>Priority: Blocker
>
> LocalSchedulerBackend does not handle calling back to the Launcher upon 
> TaskState change. It does send a callback to setState to FINISHED upon 
> stop(). Apps that are FAILED are set as FINISHED in SparkAppHandle.State.
> It looks like a case statement is needed in the statusUpdate() method in 
> LocalSchedulerBacked to call stop( state) or  launcherBackend.setState(state) 
> with the appropriate SparkAppHandle.State for TaskStates FAILED, LAUNCHING, 
> and, possibly, FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19379) SparkAppHandle.getState not registered FAILED state upon Spark app failure

2017-01-26 Thread Adam Kramer (JIRA)

Adam Kramer created SPARK-19379:
---

 Summary: SparkAppHandle.getState not registered FAILED state upon 
Spark app failure
 Key: SPARK-19379
 URL: https://issues.apache.org/jira/browse/SPARK-19379
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Adam Kramer
Priority: Blocker


LocalSchedulerBackend does not handle calling back to the Launcher upon 
TaskState change. It does send a callback to setState to FINISHED upon stop(). 
Apps that are FAILED are set as FINISHED in SparkAppHandle.State.

It looks like a case statement is needed in the statusUpdate() method in 
LocalSchedulerBacked to call stop( state) or  launcherBackend.setState(state) 
with the appropriate SparkAppHandle.State for TaskStates FAILED, LAUNCHING, 
and, possibly, FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-19378:

Description: 
If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0

This also affects eventTime statistics. We should still provide the min, max, 
avg even through the data didn't change.

  was:
If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0



> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0
> This also affects eventTime statistics. We should still provide the min, max, 
> avg even through the data didn't change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19378:


Assignee: Burak Yavuz  (was: Apache Spark)

> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19378:


Assignee: Apache Spark  (was: Burak Yavuz)

> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840757#comment-15840757
 ] 

Apache Spark commented on SPARK-19378:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16716

> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2017-01-26 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840711#comment-15840711
 ] 

Joseph K. Bradley commented on SPARK-4638:
--

Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs 
in Spark would be great to have.  The main barriers are:
* Kernelized SVM training is hard to distribute.  Naive methods require a lot 
of communication.  To get this feature into Spark, we'd need to do proper 
background research and write up a good design.
* Other ML algorithms are arguably more in demand and still need improvements 
(as of the date of this comment).  Tree ensembles are first-and-foremost in my 
mind.

> Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
> find non linear boundaries
> ---
>
> Key: SPARK-4638
> URL: https://issues.apache.org/jira/browse/SPARK-4638
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: madankumar s
>  Labels: Gaussian, Kernels, SVM
> Attachments: kernels-1.3.patch
>
>
> SPARK MLlib Classification Module:
> Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-26 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-19378:
---

 Summary: StateOperator metrics should still return the total 
number of rows in state even if there was no data for a trigger
 Key: SPARK-19378
 URL: https://issues.apache.org/jira/browse/SPARK-19378
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


If you have a StreamingDataFrame with an aggregation, we report a metric called 
stateOperators which consists of a list of data points per aggregation for our 
query (With Spark 2.1, only one aggregation is supported).

These data points report:
 - numUpdatedStateRows
 - numTotalStateRows

If a trigger had no data - therefore was not fired - we return 0 data points, 
however we should actually return a data point with
 - numTotalStateRows: numTotalStateRows in lastExecution
 - numUpdatedStateRows: 0




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-01-26 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18080:
--
Assignee: Yun Ni  (was: Yanbo Liang)

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-01-26 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-19354:
--
Description: 
When we enable speculation, we can see there are multiple attempts running for 
the same task when the first task progress is slow. If any of the task attempt 
succeeds then the other attempts will be killed, during killing the attempts 
those attempts are getting marked as failed due to the below error. We need to 
handle this error and mark the attempt as KILLED instead of FAILED.


||93||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
xx.xx.xx.x2
stdout
stderr||2017/01/24 10:30:44 ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
||java.io.IOException: Failed on local exception: 
java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
"node2/xx.xx.xx.x2"; destination host is: "node1":9000; 
+details||

{code:xml}
17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in stage 
1.0 (TID 214)
17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 214)
java.io.IOException: Failed on local exception: 
java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
"stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy17.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy18.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
at org.apache.spark.scheduler.Task.run(Task.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.a

[jira] [Updated] (SPARK-19377) Killed tasks should have the status as KILLED

2017-01-26 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-19377:
--
Description: 
|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
stdout
stderr |2017/01/25 07:49:27 |0 ms   |0.0 B / 0  |0.0 B 
/ 0  |TaskKilled (killed intentionally)|



|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
stdout
stderr |2017/01/25 07:49:27 |0 ms   |0.0 B / 0  |0.0 B 
/ 0  |TaskKilled (killed intentionally)|

Killed tasks show the task status as SUCCESS, I think we should have the status 
as KILLED for the killed tasks.

  was:
|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
stdout
stderr
|2017/01/25 07:49:27|0 ms   |0.0 B / 0  |0.0 B / 0  
|TaskKilled (killed intentionally)|



|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
stdout
stderr
|2017/01/25 07:49:27|0 ms   |0.0 B / 0  |0.0 B / 0  
|TaskKilled (killed intentionally)|

Killed tasks show the task status as SUCCESS, I think we should have the status 
as KILLED for the killed tasks.


> Killed tasks should have the status as KILLED
> -
>
> Key: SPARK-19377
> URL: https://issues.apache.org/jira/browse/SPARK-19377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Devaraj K
>Priority: Minor
>
> |143  |10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> |156  |11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> Killed tasks show the task status as SUCCESS, I think we should have the 
> status as KILLED for the killed tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19377) Killed tasks should have the status as KILLED

2017-01-26 Thread Devaraj K (JIRA)

Devaraj K created SPARK-19377:
-

 Summary: Killed tasks should have the status as KILLED
 Key: SPARK-19377
 URL: https://issues.apache.org/jira/browse/SPARK-19377
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Devaraj K
Priority: Minor


|143|10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
stdout
stderr
|2017/01/25 07:49:27|0 ms   |0.0 B / 0  |0.0 B / 0  
|TaskKilled (killed intentionally)|



|156|11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
stdout
stderr
|2017/01/25 07:49:27|0 ms   |0.0 B / 0  |0.0 B / 0  
|TaskKilled (killed intentionally)|

Killed tasks show the task status as SUCCESS, I think we should have the status 
as KILLED for the killed tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19067) mapWithState Style API

2017-01-26 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-19067:
-

Assignee: Tathagata Das

> mapWithState Style API
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{mapWithState}}) to structured 
> streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-01-26 Thread Thunder Stumpges (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840569#comment-15840569
 ] 

Thunder Stumpges commented on SPARK-19371:
--

Thanks Sean. I don't think I can (always) rely on perfect distribution of the 
underlying data in HDFS and data locality for this as the number of files (I'm 
using parquet, but this could vary a lot) and data source in general could be 
different in different cases. An extreme example might be a query to 
ElasticSearch or Cassandra or other external store that has no locality with 
the spark executors.

I guess I can agree that you might not want to classify this as a bug, but it 
is at minimum a very important improvement / new feature request, as it can 
make certain workloads prohibitively slow, and create quite unbalanced 
utilization of a cluster on the whole.

I think shuffling is probably the right idea. It seems to me once an RDD is 
cached, a simple "rebalance" could be used to move the partitions around the 
executors in preparation for additional operations.


> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19376) CLONE - CheckAnalysis rejects TPCDS query 32

2017-01-26 Thread Mostafa Shahdadi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Shahdadi resolved SPARK-19376.
--
Resolution: Fixed

> CLONE - CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-19376
> URL: https://issues.apache.org/jira/browse/SPARK-19376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Shahdadi
>Assignee: Nattavut Sutyanyong
>Priority: Blocker
> Fix For: 2.1.0
>
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_

[jira] [Created] (SPARK-19376) CLONE - CheckAnalysis rejects TPCDS query 32

2017-01-26 Thread Mostafa Shahdadi (JIRA)

Mostafa Shahdadi created SPARK-19376:


 Summary: CLONE - CheckAnalysis rejects TPCDS query 32
 Key: SPARK-19376
 URL: https://issues.apache.org/jira/browse/SPARK-19376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Mostafa Shahdadi
Assignee: Nattavut Sutyanyong
Priority: Blocker
 Fix For: 2.1.0


It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem to 
be any obvious error in the query or the check rule though: in the plan below, 
the scalar subquery's condition field is "scalar-subquery#24 
[(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by the 
scalar subquery predicates.

analysis error:
{code}
== Query: q32-v1.4 ==
 Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
in a scalar correlated subquery cannot contain non-correlated columns: 
cs_item_sk#39;;
GlobalLimit 100
+- LocalLimit 100
   +- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
  +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = cs_item_sk#39)) 
&& ((d_date#83 >= 2000-01-27) && (d_date#83 <= cast(cast(cast(cast(2000-01-27 
as date) as timestamp) + interval 12 weeks 6 days as date) as string && 
((d_date_sk#81 = cs_sold_date_sk#58) && (cast(cs_ext_discount_amt#46 as 
decimal(14,7)) > cast(scalar-subquery#24 [(cs_item_sk#39#111 = i_item_sk#59)] 
as decimal(14,7)
 :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
cs_item_sk#39#111]
 : +- Aggregate [cs_item_sk#39], 
[CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
 :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
 :   +- Join Inner
 :  :- SubqueryAlias catalog_sales
 :  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
 :  +- SubqueryAlias date_dim
 : +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet
 +- Join Inner
:- Join Inner
:  :- SubqueryAlias catalog_sales
:  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
:  +- SubqueryAlias item
: +- 
Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
 parquet
+- SubqueryAlias date_dim
   +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet

{code}

query text:
{code}
select sum(cs_ext_dis

[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Priority: Blocker  (was: Major)

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>Priority: Blocker
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
>   at 
> org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaRDD rdd) throws Exception {
>cleaner.doCle

[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Description: 
-- update --- we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue to be evicted gracefully from memory, 
obviously kinesis library race condition is a problem onto itself...

-- exception leading to a block not being freed up --

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at 
org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at 
org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at 
org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up

  was:
-- update --- we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue to be evicted gracefull

[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Summary: Stream Blocks in Storage Persists Forever when Kinesis Checkpoints 
are enabled and an exception is thrown   (was: Some Stream Blocks in Storage 
Persists Forever)

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaRDD rdd) throws Exception {
>cleaner.doCleanupRDD(rdd.id(), true);
> }
> });
> despite above blocks do persis still, this can be seen in spark admin UI
> for instance
> input-0-1485362233945 1   ip-<>:34245 Memory Serialized   1442.5 
> KB
> above block stays and is never cleaned up



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-26 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840419#comment-15840419
 ] 

Charles Allen commented on SPARK-19111:
---

While switching to s3a helped the logs upload, it made the spark history server 
unusable, which is probably another bug.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 IN

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-26 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840418#comment-15840418
 ] 

Charles Allen commented on SPARK-19111:
---

We have a patch https://github.com/apache/spark/pull/16714 for SPARK-16333 
which fixes the problem on our side by disabling the verbose new metrics.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload

[jira] [Updated] (SPARK-19364) Some Stream Blocks in Storage Persists Forever

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Description: 
-- update --- we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue to be evicted gracefully from memory, 
obviously kinesis library race condition is a problem onto itself...

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up

  was:
-- update --- we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue graecfully

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up


> Some Stream Blocks in Storage Persists Forever
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaRDD rdd) throws Exception {
>cleaner.doCleanupRDD(rdd.id(), true);
> }
> });
> despite above blocks do persis still, this can be seen in spark admin UI
> for instance
> input-0-1485362233945 1   ip-<>:34245 Memory Serialized   1442.5 
> KB
> above block stays and is never cleaned up



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19364) Some Stream Blocks in Storage Persists Forever

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Description: 
-- update --- we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue graecfully

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up

  was:
*** update *** we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue graecfully

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up


> Some Stream Blocks in Storage Persists Forever
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue graecfully
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaRDD rdd) throws Exception {
>cleaner.doCleanupRDD(rdd.id(), true);
> }
> });
> despite above blocks do persis still, this can be seen in spark admin UI
> for instance
> input-0-1485362233945 1   ip-<>:34245 Memory Serialized   1442.5 
> KB
> above block stays and is never cleaned up



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19364) Some Stream Blocks in Storage Persists Forever

2017-01-26 Thread Andrew Milkowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Milkowski updated SPARK-19364:
-
Description: 
*** update *** we found that below situation occurs when we encounter

"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
Can't update checkpoint - instance doesn't hold the lease for this shard"

https://github.com/awslabs/amazon-kinesis-client/issues/108

we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
blocks should not get stuck but continue graecfully

running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up

  was:
running standard kinesis stream ingestion with a java spark app and creating 
dstream after running for some time some block streams seem to persist forever 
and never cleaned up and this eventually leads to memory depletion on workers

we even tried cleaning RDD's with the following:

cleaner = ssc.sparkContext().sc().cleaner().get();

filtered.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD rdd) throws Exception {
   cleaner.doCleanupRDD(rdd.id(), true);
}
});

despite above blocks do persis still, this can be seen in spark admin UI

for instance

input-0-1485362233945   1   ip-<>:34245 Memory Serialized   1442.5 
KB

above block stays and is never cleaned up


> Some Stream Blocks in Storage Persists Forever
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>
> *** update *** we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue graecfully
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new VoidFunction>() {
> @Override
> public void call(JavaRDD rdd) throws Exception {
>cleaner.doCleanupRDD(rdd.id(), true);
> }
> });
> despite above blocks do persis still, this can be seen in spark admin UI
> for instance
> input-0-1485362233945 1   ip-<>:34245 Memory Serialized   1442.5 
> KB
> above block stays and is never cleaned up



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-26 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840390#comment-15840390
 ] 

Herman van Hovell commented on SPARK-15505:
---

I am closing this as a won't fix. Thanks for the input though.

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19375) na.fill() should not change the data type of column

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-19375.
--
Resolution: Duplicate

> na.fill() should not change the data type of column
> ---
>
> Key: SPARK-19375
> URL: https://issues.apache.org/jira/browse/SPARK-19375
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Davies Liu
>
> When these function are called, a column of numeric type will be converted 
> into DoubleType, which will cause unexpected type casting and also precision 
> loss for LongType.
> {code}
> def fill(value: Double)
> def fill(value: Double, cols: Array[String])
> def fill(value: Double, cols: Seq[String])
> {code}
> Workaround
> {code}
> fill(Map("name" -> v))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-26 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-15505.
-
Resolution: Won't Fix

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840388#comment-15840388
 ] 

Apache Spark commented on SPARK-18080:
--

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/16715

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19374) java.security.KeyManagementException: Default SSLContext is initialized automatically

2017-01-26 Thread Derek M Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek M Miller updated SPARK-19374:
---
Description: 
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults (obviously changed the passwords 
and paths):

{code}
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
{code}

and I am getting the following exception:

{code}
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/01/26 20:02:31 INFO util.ShutdownHookManager: Shutdown hook called

{code}


  was:
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

{code}
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
{code}

and I am getting the following exception:

{code}
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAc

[jira] [Updated] (SPARK-19374) java.security.KeyManagementException: Default SSLContext is initialized automatically

2017-01-26 Thread Derek M Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek M Miller updated SPARK-19374:
---
Description: 
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

{code}
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
{code}

and I am getting the following exception:

{code}
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/01/26 20:02:31 INFO util.ShutdownHookManager: Shutdown hook called

{code}


  was:
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

{code}
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
{code}

and I am getting the following exception:

```
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:

[jira] [Updated] (SPARK-19374) java.security.KeyManagementException: Default SSLContext is initialized automatically

2017-01-26 Thread Derek M Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek M Miller updated SPARK-19374:
---
Description: 
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

{code}
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
{code}

and I am getting the following exception:

```
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/01/26 20:02:31 INFO util.ShutdownHookManager: Shutdown hook called

```


  was:
I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

```
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
```

and I am getting the following exception:

```
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

[jira] [Created] (SPARK-19375) na.fill() should not change the data type of column

2017-01-26 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19375:
--

 Summary: na.fill() should not change the data type of column
 Key: SPARK-19375
 URL: https://issues.apache.org/jira/browse/SPARK-19375
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 2.2.0
Reporter: Davies Liu


When these function are called, a column of numeric type will be converted into 
DoubleType, which will cause unexpected type casting and also precision loss 
for LongType.

{code}
def fill(value: Double)
def fill(value: Double, cols: Array[String])
def fill(value: Double, cols: Seq[String])
{code}

Workaround
{code}
fill(Map("name" -> v))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19374) java.security.KeyManagementException: Default SSLContext is initialized automatically

2017-01-26 Thread Derek M Miller (JIRA)

Derek M Miller created SPARK-19374:
--

 Summary: java.security.KeyManagementException: Default SSLContext 
is initialized automatically
 Key: SPARK-19374
 URL: https://issues.apache.org/jira/browse/SPARK-19374
 Project: Spark
  Issue Type: Bug
Reporter: Derek M Miller


I am currently getting an SSL error when turning on ssl. I have confirmed 
nothing is wrong the certificates as well. This is running on the emr and has 
the following configuration for spark-defaults:

```
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.yarn.maxAppAttempts": "1",
  "spark.yarn.executor.memoryOverhead": "2048",
  "spark.ssl.enabled": "true",
  "spark.ssl.keyStore": "not_real_path/keystore.jks",
  "spark.ssl.keyStorePassword": "not_a_real_password",
  "spark.ssl.trustStore": "not_real_path/truststore.jks",
  "spark.ssl.trustStorePassword": "not_a_real_password"
}
  }
```

and I am getting the following exception:

```
17/01/26 20:02:31 INFO spark.SecurityManager: Changing view acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/01/26 20:02:31 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
17/01/26 20:02:31 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1485460802835_0001
Exception in thread "main" java.security.KeyManagementException: Default 
SSLContext is initialized automatically
at 
sun.security.ssl.SSLContextImpl$DefaultSSLContext.engineInit(SSLContextImpl.java:749)
at javax.net.ssl.SSLContext.init(SSLContext.java:282)
at org.apache.spark.SecurityManager.(SecurityManager.scala:284)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1021)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/01/26 20:02:31 INFO util.ShutdownHookManager: Shutdown hook called

```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19316) Spark event logs are huge compared to 1.5.2

2017-01-26 Thread Jisoo Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840348#comment-15840348
 ] 

Jisoo Kim commented on SPARK-19316:
---

Found the duplicate and made a PR to resolve the issue 
https://github.com/apache/spark/pull/16714. I wasn't sure if I needed to 
include this JIRA ticket to the name so I only mentioned the old one.

> Spark event logs are huge compared to 1.5.2
> ---
>
> Key: SPARK-19316
> URL: https://issues.apache.org/jira/browse/SPARK-19316
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jisoo Kim
>
> I have a Spark application with many tasks (more than 40k). The event logs 
> for such application used to be around 2g when I was using Spark 1.5.2 
> standalone cluster. Now that I am using Spark 2.0 with Mesos, the size of the 
> event log of such application drastically increased from 2g to 60g with a 
> similar number of tasks. This is affecting Spark History Server since it is 
> having trouble reading such huge event log. I wonder the increase in a size 
> of an event log is expected in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16333:


Assignee: (was: Apache Spark)

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16333:


Assignee: Apache Spark

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>Assignee: Apache Spark
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840340#comment-15840340
 ] 

Apache Spark commented on SPARK-16333:
--

User 'jisookim0513' has created a pull request for this issue:
https://github.com/apache/spark/pull/16714

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-26 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840227#comment-15840227
 ] 

Jorge Machado commented on SPARK-15505:
---

Sometimes you read from a database that has an array in it, at least was in my 
case. But fell free to reject it. 

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18873) New test cases for scalar subquery

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18873:


Assignee: Apache Spark

> New test cases for scalar subquery
> --
>
> Key: SPARK-18873
> URL: https://issues.apache.org/jira/browse/SPARK-18873
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>
> This JIRA is for submitting a PR for new test cases on scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18873) New test cases for scalar subquery

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840197#comment-15840197
 ] 

Apache Spark commented on SPARK-18873:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/16712

> New test cases for scalar subquery
> --
>
> Key: SPARK-18873
> URL: https://issues.apache.org/jira/browse/SPARK-18873
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is for submitting a PR for new test cases on scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18873) New test cases for scalar subquery

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18873:


Assignee: (was: Apache Spark)

> New test cases for scalar subquery
> --
>
> Key: SPARK-18873
> URL: https://issues.apache.org/jira/browse/SPARK-18873
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is for submitting a PR for new test cases on scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-01-26 Thread Michael Gummelt (JIRA)

Michael Gummelt created SPARK-19373:
---

 Summary: Mesos implementation of 
spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than 
registerd cores
 Key: SPARK-19373
 URL: https://issues.apache.org/jira/browse/SPARK-19373
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Michael Gummelt


We're currently using `totalCoresAcquired` to account for registered resources, 
which is incorrect.  That variable measures the number of cores the scheduler 
has accepted.  We should be using `totalCoreCount` like the other schedulers do.

Fixing this is important for locality, since users often want to wait for all 
executors to come up before scheduling tasks to ensure they get a node-local 
placement. 

original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-01-26 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840186#comment-15840186
 ] 

Cheng Lian commented on SPARK-18539:


[~viirya], sorry for the (super) late reply. What I mentioned was a *nullable* 
column instead of a *null* column. To be more specific, say we have two Parquet 
files:

- File {{A}} has columns {{}}
- File {{B}} has columns {{}}, where {{c}} is marked as nullable (or 
{{optional}} in the term of Parquet)

Then it should be fine to treat these two files as a single dataset with a 
merged schema {{}} and you should be able to push down predicates 
involving {{c}}.

BTW, the Parquet community just made a patch release 1.8.2 that includes a fix 
for PARQUET-389 and we probably will upgrade to 1.8.2 in 2.2.0. Then we'll have 
a proper fix for this issue and remove the workaround we did while doing schema 
merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonf

[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-26 Thread Ilya Matiach (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840151#comment-15840151
 ] 

Ilya Matiach commented on SPARK-17975:
--

[~josephkb] I was able to verify that this issue is now fixed after rebasing to 
master - can you please close the bug?  Thank you!

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840138#comment-15840138
 ] 

Apache Spark commented on SPARK-19220:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16711

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.2.0
>
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19338) Always Identical Name for UDF in the EXPLAIN output

2017-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19338.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Always Identical Name for UDF in the EXPLAIN output 
> 
>
> Key: SPARK-19338
> URL: https://issues.apache.org/jira/browse/SPARK-19338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
> Fix For: 2.1.1, 2.2.0
>
>
> {noformat}
> sql("SELECT udf1(udf2(42))").explain()
> {noformat}
> {noformat}
> == Physical Plan ==
> *Project [UDF(UDF(42)) AS UDF(UDF(42))#7]
> +- Scan OneRowRelation[]
> {noformat}
> Although udf1 and udf2 are different UDF, but the name in the plans are the 
> same. It looks confusing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19338) Always Identical Name for UDF in the EXPLAIN output

2017-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19338:

Assignee: Takeshi Yamamuro

> Always Identical Name for UDF in the EXPLAIN output 
> 
>
> Key: SPARK-19338
> URL: https://issues.apache.org/jira/browse/SPARK-19338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>
> {noformat}
> sql("SELECT udf1(udf2(42))").explain()
> {noformat}
> {noformat}
> == Physical Plan ==
> *Project [UDF(UDF(42)) AS UDF(UDF(42))#7]
> +- Scan OneRowRelation[]
> {noformat}
> Although udf1 and udf2 are different UDF, but the name in the plans are the 
> same. It looks confusing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-01-26 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840113#comment-15840113
 ] 

Sean Owen commented on SPARK-19371:
---

There's a tension between waiting for locality and taking a free slot that's 
not local. You can increase spark.locality.wait to prefer locality even at the 
expense of delays.

If your underlying HDFS data isn't spread out well, it won't help though. You 
should make sure you have some HDFS replication enabled.

You can try 2x caching replication to make more copies available, obviously at 
the expense of more memory.

Shuffling is actually a decent idea; one way or the other if the data isn't 
evenly spread, some if it has to be copied.

I don't think this is a bug per se.

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18872) New test cases for EXISTS subquery

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18872:


Assignee: Apache Spark

> New test cases for EXISTS subquery
> --
>
> Key: SPARK-18872
> URL: https://issues.apache.org/jira/browse/SPARK-18872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>
> This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test 
> cases. It follows the same idea as the IN subquery test cases which contain 
> simple patterns, then build more complex constructs in both parent and 
> subquery sides. This batch of test cases are mostly, if not all, positive 
> test cases that do not return any syntax errors or unsupported functionality. 
> We make effort to have test cases returning rows in the result set so that 
> they can indirectly detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18872) New test cases for EXISTS subquery

2017-01-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840098#comment-15840098
 ] 

Apache Spark commented on SPARK-18872:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/16710

> New test cases for EXISTS subquery
> --
>
> Key: SPARK-18872
> URL: https://issues.apache.org/jira/browse/SPARK-18872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test 
> cases. It follows the same idea as the IN subquery test cases which contain 
> simple patterns, then build more complex constructs in both parent and 
> subquery sides. This batch of test cases are mostly, if not all, positive 
> test cases that do not return any syntax errors or unsupported functionality. 
> We make effort to have test cases returning rows in the result set so that 
> they can indirectly detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18872) New test cases for EXISTS subquery

2017-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18872:


Assignee: (was: Apache Spark)

> New test cases for EXISTS subquery
> --
>
> Key: SPARK-18872
> URL: https://issues.apache.org/jira/browse/SPARK-18872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test 
> cases. It follows the same idea as the IN subquery test cases which contain 
> simple patterns, then build more complex constructs in both parent and 
> subquery sides. This batch of test cases are mostly, if not all, positive 
> test cases that do not return any syntax errors or unsupported functionality. 
> We make effort to have test cases returning rows in the result set so that 
> they can indirectly detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-01-26 Thread Jay Pranavamurthi (JIRA)

Jay Pranavamurthi created SPARK-19372:
-

 Summary: Code generation for Filter predicate including many OR 
conditions exceeds JVM method size limit 
 Key: SPARK-19372
 URL: https://issues.apache.org/jira/browse/SPARK-19372
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Jay Pranavamurthi
 Attachments: wide400cols.csv

For the attached csv file, the code below causes the exception 
"org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
grows beyond 64 KB

Code:
{code:borderStyle=solid}
  val conf = new SparkConf().setMaster("local[1]")
  val sqlContext = SparkSession.builder().config(conf).getOrCreate().sqlContext

  val dataframe =
sqlContext
  .read
  .format("com.databricks.spark.csv")
  .load("wide400cols.csv")

  val filter = (0 to 399)
.foldLeft(lit(false))((e, index) => 
e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))

  val filtered = dataframe.filter(filter)
  filtered.show(100)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-01-26 Thread Jay Pranavamurthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Pranavamurthi updated SPARK-19372:
--
Attachment: wide400cols.csv

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-01-26 Thread Thunder Stumpges (JIRA)

Thunder Stumpges created SPARK-19371:


 Summary: Cannot spread cached partitions evenly across executors
 Key: SPARK-19371
 URL: https://issues.apache.org/jira/browse/SPARK-19371
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Thunder Stumpges


Before running an intensive iterative job (in this case a distributed topic 
model training), we need to load a dataset and persist it across executors. 

After loading from HDFS and persisting, the partitions are spread unevenly 
across executors (based on the initial scheduling of the reads which are not 
data locale sensitive). The partition sizes are even, just not their 
distribution over executors. We currently have no way to force the partitions 
to spread evenly, and as the iterative algorithm begins, tasks are distributed 
to executors based on this initial load, forcing some very unbalanced work.

This has been mentioned a 
[number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
 of 
[times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
 in 
[various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
 user/dev group threads.

None of the discussions I could find had solutions that worked for me. Here are 
examples of things I have tried. All resulted in partitions in memory that were 
NOT evenly distributed to executors, causing future tasks to be imbalanced 
across executors as well.

*Reduce Locality*
{code}spark.shuffle.reduceLocality.enabled=false/true{code}

*"Legacy" memory mode*
{code}spark.memory.useLegacyMode = true/false{code}

*Basic load and repartition*
{code}
val numPartitions = 48*16
val df = sqlContext.read.
parquet("/data/folder_to_load").
repartition(numPartitions).
persist
df.count
{code}

*Load and repartition to 2x partitions, then shuffle repartition down to 
desired partitions*
{code}
val numPartitions = 48*16
val df2 = sqlContext.read.
parquet("/data/folder_to_load").
repartition(numPartitions*2)
val df = df2.repartition(numPartitions).
persist
df.count
{code}

It would be great if when persisting an RDD/DataFrame, if we could request that 
those partitions be stored evenly across executors in preparation for future 
tasks. 

I'm not sure if this is a more general issue (I.E. not just involving 
persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
difference in the over-all running time of the remaining work.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19369) SparkConf not getting properly initialized in PySpark 2.1.0

2017-01-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19369.

Resolution: Duplicate

Try setting those in the command line for now; this will be fixed in 2.1.1.

> SparkConf not getting properly initialized in PySpark 2.1.0
> ---
>
> Key: SPARK-19369
> URL: https://issues.apache.org/jira/browse/SPARK-19369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Windows/Linux
>Reporter: Sidney Feiner
>  Labels: configurations, context, pyspark
>
> Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
> my SparkContext doesn't get its configurations from the SparkConf object. 
> Before passing them onto to the SparkContext constructor, I've made sure my 
> configuration are set.
> I've done some digging and this is what I've found:
> When I initialize the SparkContext, the following code is executed:
> def _do_init(self, master, appName, sparkHome, pyFiles, environment, 
> batchSize, serializer,
>  conf, jsc, profiler_cls):
> self.environment = environment or {}
> if conf is not None and conf._jconf is not None:
>self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> So I can see that the only way that my SparkConf will be used is if it also 
> has a _jvm object.
> I've used spark-submit to submit my job and printed the _jvm object but it is 
> null, which explains why my SparkConf object is ignored.
> I've tried running exactly the same on Spark 2.0.1 and it worked! My 
> SparkConf object had a valid _jvm object.
> Am i doing something wrong or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-19370:
---
Affects Version/s: 2.1.0

> Flaky test: MetadataCacheSuite
> --
>
> Key: SPARK-19370
> URL: https://issues.apache.org/jira/browse/SPARK-19370
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
> {code}
> MetadataCacheSuite:
> - SPARK-16336 Suggest doing table refresh when encountering 
> FileNotFoundException
> Exception encountered when invoking run on a nested suite - There are 1 
> possibly leaked file streams. *** ABORTED ***
>   java.lang.RuntimeException: There are 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
>   at 
> org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19370:
--

 Summary: Flaky test: MetadataCacheSuite
 Key: SPARK-19370
 URL: https://issues.apache.org/jira/browse/SPARK-19370
 Project: Spark
  Issue Type: Test
Reporter: Davies Liu
Assignee: Shixiong Zhu


{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-19370:
---
Description: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}

  was:
{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}


> Flaky test: MetadataCacheSuite
> --
>
> Key: SPARK-19370
> URL: https://issues.apache.org/jira/browse/SPARK-19370
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
> {code}
> MetadataCacheSuite:
> - SPARK-16336 Suggest doing table refresh when encountering 
> FileNotFoundException
> Exception encountered when invoking run on a nested suite - There are 1 
> possibly leaked file streams. *** ABORTED ***
>   java.lang.RuntimeException: There are 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
>   at 
> org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19368:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Do you see ways to optimize for both types of cases at the same time?

> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
>Priority: Minor
> Attachments: profiler snapshot.png
>
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
> (i,j,d) => (i,(j,d)) }.toSeq
> val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
> val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
> Vectors.sparse(n, e._2.toSeq)))
> val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)
> val t1 = System.nanoTime()
> 
> println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t2 = System.nanoTime()
> println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
> println("")
> 
> println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t3 = System.nanoTime()
> println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
> println("")
> {quote}
> I get:
> {quote}
> took: 9404 ms
> 
> took: 57350 ms
> 
> {quote}
> Looking at it a little with a profiler, I see that the problem is with the 
> SliceVector.update() and SparseVector.apply.
> I currently work-around this by doing:
> {quote}
> blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
> {quote}
> like it was in version 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-26 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839908#comment-15839908
 ] 

Herman van Hovell commented on SPARK-15505:
---

H This is will be quite bad in terms of performance. To me the 
usefulness also seems a bit limited, you cannot really reason about the 
structure of the data frame after doing this. Why would this be useful?

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19369) SparkConf not getting properly initialized in PySpark 2.1.0

2017-01-26 Thread Sidney Feiner (JIRA)

Sidney Feiner created SPARK-19369:
-

 Summary: SparkConf not getting properly initialized in PySpark 
2.1.0
 Key: SPARK-19369
 URL: https://issues.apache.org/jira/browse/SPARK-19369
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
 Environment: Windows/Linux
Reporter: Sidney Feiner


Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
my SparkContext doesn't get its configurations from the SparkConf object. 
Before passing them onto to the SparkContext constructor, I've made sure my 
configuration are set.

I've done some digging and this is what I've found:

When I initialize the SparkContext, the following code is executed:

def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, 
serializer,
 conf, jsc, profiler_cls):
self.environment = environment or {}
# java gateway must have been launched at this point.
if conf is not None and conf._jconf is not None:
# conf has been initialized in JVM properly, so use conf directly. This 
represent the
# scenario that JVM has been launched before SparkConf is created (e.g. 
SparkContext is
# created and then stopped, and we create a new SparkConf and new 
SparkContext again)
   self._conf = conf
else:
self._conf = SparkConf(_jvm=SparkContext._jvm)


So I can see that the only way that my SparkConf will be used is if it also has 
a _jvm object.
I've used spark-submit to submit my job and printed the _jvm object but it is 
null, which explains why my SparkConf object is ignored.
I've tried running exactly the same on Spark 2.0.1 and it worked! My SparkConf 
object had a valid _jvm object.

Am i doing something wrong or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19369) SparkConf not getting properly initialized in PySpark 2.1.0

2017-01-26 Thread Sidney Feiner (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sidney Feiner updated SPARK-19369:
--
Description: 
Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
my SparkContext doesn't get its configurations from the SparkConf object. 
Before passing them onto to the SparkContext constructor, I've made sure my 
configuration are set.

I've done some digging and this is what I've found:

When I initialize the SparkContext, the following code is executed:

def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, 
serializer,
 conf, jsc, profiler_cls):
self.environment = environment or {}
if conf is not None and conf._jconf is not None:
   self._conf = conf
else:
self._conf = SparkConf(_jvm=SparkContext._jvm)


So I can see that the only way that my SparkConf will be used is if it also has 
a _jvm object.
I've used spark-submit to submit my job and printed the _jvm object but it is 
null, which explains why my SparkConf object is ignored.
I've tried running exactly the same on Spark 2.0.1 and it worked! My SparkConf 
object had a valid _jvm object.

Am i doing something wrong or is this a bug?

  was:
Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
my SparkContext doesn't get its configurations from the SparkConf object. 
Before passing them onto to the SparkContext constructor, I've made sure my 
configuration are set.

I've done some digging and this is what I've found:

When I initialize the SparkContext, the following code is executed:

def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, 
serializer,
 conf, jsc, profiler_cls):
self.environment = environment or {}
# java gateway must have been launched at this point.
if conf is not None and conf._jconf is not None:
# conf has been initialized in JVM properly, so use conf directly. This 
represent the
# scenario that JVM has been launched before SparkConf is created (e.g. 
SparkContext is
# created and then stopped, and we create a new SparkConf and new 
SparkContext again)
   self._conf = conf
else:
self._conf = SparkConf(_jvm=SparkContext._jvm)


So I can see that the only way that my SparkConf will be used is if it also has 
a _jvm object.
I've used spark-submit to submit my job and printed the _jvm object but it is 
null, which explains why my SparkConf object is ignored.
I've tried running exactly the same on Spark 2.0.1 and it worked! My SparkConf 
object had a valid _jvm object.

Am i doing something wrong or is this a bug?


> SparkConf not getting properly initialized in PySpark 2.1.0
> ---
>
> Key: SPARK-19369
> URL: https://issues.apache.org/jira/browse/SPARK-19369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Windows/Linux
>Reporter: Sidney Feiner
>  Labels: configurations, context, pyspark
>
> Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
> my SparkContext doesn't get its configurations from the SparkConf object. 
> Before passing them onto to the SparkContext constructor, I've made sure my 
> configuration are set.
> I've done some digging and this is what I've found:
> When I initialize the SparkContext, the following code is executed:
> def _do_init(self, master, appName, sparkHome, pyFiles, environment, 
> batchSize, serializer,
>  conf, jsc, profiler_cls):
> self.environment = environment or {}
> if conf is not None and conf._jconf is not None:
>self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> So I can see that the only way that my SparkConf will be used is if it also 
> has a _jvm object.
> I've used spark-submit to submit my job and printed the _jvm object but it is 
> null, which explains why my SparkConf object is ignored.
> I've tried running exactly the same on Spark 2.0.1 and it worked! My 
> SparkConf object had a valid _jvm object.
> Am i doing something wrong or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-01-26 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839845#comment-15839845
 ] 

Herman van Hovell commented on SPARK-19122:
---

[~tejasp] We should fix this.

The rules make sense to me. I only want to add that {{EnsureRequirements}} 
fixes the plan bottom up. This means that we can use {{outputPartitioning}} and 
the {{outputOrdering}} of a plan's children to compute the plan's 
{{requiredChildDistribution}} and {{requiredChildOrdering}}. So we don't have 
to modify these methods.


> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5786) Documentation of Narrow Dependencies

2017-01-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839788#comment-15839788
 ] 

Hyukjin Kwon commented on SPARK-5786:
-

It seems they are documented, at least, in API docs, e.g., 
https://github.com/apache/spark/blob/4cb49412d1d7d10ffcc738475928c7de2bc59fd4/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L55-L61

> Documentation of Narrow Dependencies
> 
>
> Key: SPARK-5786
> URL: https://issues.apache.org/jira/browse/SPARK-5786
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Imran Rashid
>
> Narrow dependencies can really improve job performance by skipping shuffles 
> entirely.  However aside from being mentioned in some early papers and during 
> some meetups, they aren't explained (or even mentioned) in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2687) after receving allocated containers,amClient should remove ContainerRequest.

2017-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-2687.
-
Resolution: Duplicate

I am resolving this per 
https://github.com/apache/spark/pull/3245#issuecomment-67898632 and reporter's 
approval https://github.com/apache/spark/pull/3245#issuecomment-68059290

> after receving allocated containers,amClient should remove ContainerRequest.
> 
>
> Key: SPARK-2687
> URL: https://issues.apache.org/jira/browse/SPARK-2687
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> in https://issues.apache.org/jira/browse/YARN-1902, after receving allocated 
> containers,if amClient donot remove ContainerRequest,
> RM will continually allocate container for spark AppMaster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18839) Executor is active on web, but actually is dead

2017-01-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839719#comment-15839719
 ] 

Hyukjin Kwon commented on SPARK-18839:
--

[~uncleGen] Would you mind elaborating why you think it is not a bug and taking 
an action to this JIRA if you strongly feel it is not a bug maybe?

> Executor is active on web, but actually is dead
> ---
>
> Key: SPARK-18839
> URL: https://issues.apache.org/jira/browse/SPARK-18839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
>
> When a container is preempted, AM find it is completed, driver removes the 
> blockmanager. But executor actually dead after a few seconds, during this 
> period, it updates blocks, and re-register the blockmanager. so the exeutors 
> page show the executor is active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)

2017-01-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839704#comment-15839704
 ] 

Hyukjin Kwon commented on SPARK-18579:
--

Can we just strip them within the dataframe/dataset? Available options for 
read/write are well documented. IMHO, we should not just add options/APIs just 
for consistency.

> spark-csv strips whitespace (pyspark) 
> --
>
> Key: SPARK-18579
> URL: https://issues.apache.org/jira/browse/SPARK-18579
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
>Reporter: Adrian Bridgett
>Priority: Minor
>
> ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV 
> reader (and defaults to false).
> However these are not supported options on the CSV writer and so the library 
> defaults take place which strips the whitespace.
> I think it would make the most sense if the writer semantics matched the 
> reader (and did not alter the data) however this would be a change in 
> behaviour.  In any case it'd be great to have the _option_ to strip or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19361) kafka.maxRatePerPartition for compacted topic cause exception

2017-01-26 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-19361.

Resolution: Duplicate

> kafka.maxRatePerPartition for compacted topic cause exception
> -
>
> Key: SPARK-19361
> URL: https://issues.apache.org/jira/browse/SPARK-19361
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: Natalia Gorchakova
>
> creating DirectKafkaInputDStream with param 
> spark.streaming.kafka.maxRatePerPartition for compacted topic cause exception:
> ERROR [Executor task launch worker-2] executor.Executor: Exception in task 
> 1.0 in stage 2.0 (TID 22)
> java.lang.AssertionError: assertion failed: Got 3740923 > ending offset 
> 2428156 for topic COMPACTED.KAFKA.TOPIC partition 6 start 2228156. This 
> should not happen, and indicates a message may have been skipped
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:217)
> as KafkaRDD expect maxOffset in batch <= startOffset + 
> maxRatePerPartition*secondsInBatch. While for compacted topic some offsets 
> can be missing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19361) kafka.maxRatePerPartition for compacted topic cause exception

2017-01-26 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839683#comment-15839683
 ] 

Cody Koeninger commented on SPARK-19361:


Compacted topics in general don't work with direct stream, this isn't unique to 
maxRate.

There's an existing ticket on this,  
https://issues.apache.org/jira/browse/SPARK-17147 , with work in progress code. 
 If this feature is important to you, go help test it,

> kafka.maxRatePerPartition for compacted topic cause exception
> -
>
> Key: SPARK-19361
> URL: https://issues.apache.org/jira/browse/SPARK-19361
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: Natalia Gorchakova
>
> creating DirectKafkaInputDStream with param 
> spark.streaming.kafka.maxRatePerPartition for compacted topic cause exception:
> ERROR [Executor task launch worker-2] executor.Executor: Exception in task 
> 1.0 in stage 2.0 (TID 22)
> java.lang.AssertionError: assertion failed: Got 3740923 > ending offset 
> 2428156 for topic COMPACTED.KAFKA.TOPIC partition 6 start 2228156. This 
> should not happen, and indicates a message may have been skipped
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:217)
> as KafkaRDD expect maxOffset in batch <= startOffset + 
> maxRatePerPartition*secondsInBatch. While for compacted topic some offsets 
> can be missing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-26 Thread Ohad Raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839673#comment-15839673
 ] 

Ohad Raviv commented on SPARK-19368:


caused by..

> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
> Attachments: profiler snapshot.png
>
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
> (i,j,d) => (i,(j,d)) }.toSeq
> val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
> val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
> Vectors.sparse(n, e._2.toSeq)))
> val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)
> val t1 = System.nanoTime()
> 
> println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t2 = System.nanoTime()
> println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
> println("")
> 
> println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t3 = System.nanoTime()
> println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
> println("")
> {quote}
> I get:
> {quote}
> took: 9404 ms
> 
> took: 57350 ms
> 
> {quote}
> Looking at it a little with a profiler, I see that the problem is with the 
> SliceVector.update() and SparseVector.apply.
> I currently work-around this by doing:
> {quote}
> blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
> {quote}
> like it was in version 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-26 Thread Ohad Raviv (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv updated SPARK-19368:
---
Attachment: profiler snapshot.png

> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
> Attachments: profiler snapshot.png
>
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
> (i,j,d) => (i,(j,d)) }.toSeq
> val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
> val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
> Vectors.sparse(n, e._2.toSeq)))
> val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)
> val t1 = System.nanoTime()
> 
> println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t2 = System.nanoTime()
> println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
> println("")
> 
> println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t3 = System.nanoTime()
> println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
> println("")
> {quote}
> I get:
> {quote}
> took: 9404 ms
> 
> took: 57350 ms
> 
> {quote}
> Looking at it a little with a profiler, I see that the problem is with the 
> SliceVector.update() and SparseVector.apply.
> I currently work-around this by doing:
> {quote}
> blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
> {quote}
> like it was in version 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17734) inner equi-join shorthand that returns Datasets, like DataFrame already has

2017-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17734.
--
Resolution: Won't Fix

We have {{joinWith}} to return {{Dataset}}. Also, we have {{join}} and 
{{Dataset}} and {{DataFrame}} are really interchangeable in the concept up to 
my knowledge. 

Even if the duplicated field names are problems, they can be just renamed 
easily. It does not sound worth introducing another new API or changing 
behaviour.

I am resolving this. Please reopen this if I misunderstood.

> inner equi-join shorthand that returns Datasets, like DataFrame already has
> ---
>
> Key: SPARK-17734
> URL: https://issues.apache.org/jira/browse/SPARK-17734
> Project: Spark
>  Issue Type: Wish
>Reporter: Leif Warner
>Priority: Minor
>
> There's an existing ".join(right: Dataset[_], usingColumn: String): 
> DataFrame" method on Dataset.
> Would appreciate it if a variant that returns typed Datasets would also 
> available.
> If you write a join that contains the common column name name, you get an 
> AnalysisException thrown because that's ambiguous, e.g:
> $"foo" === $"foo"
> So I wrote table1.toDF()("foo") === table2.toDF()("foo"), but that's a little 
> error prone, and coworkers considered it a hack and didn't want to use it, 
> because it "mixes DataFrame and Dataset api".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-26 Thread Ohad Raviv (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv updated SPARK-19368:
---
Description: 
In SPARK-12869, this function was optimized for the case of dense matrices 
using Breeze. However, I have a case with very very sparse matrices which 
suffers a great deal from this optimization. A process we have that took about 
20 mins now takes about 6.5 hours.
Here is a sample code to see the difference:
{quote}
val n = 4
val density = 0.0002
val rnd = new Random(123)
val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
(rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
  .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
(i,j,d) => (i,(j,d)) }.toSeq
val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
Vectors.sparse(n, e._2.toSeq)))
val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)

val t1 = System.nanoTime()

println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t2 = System.nanoTime()
println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
println("")

println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t3 = System.nanoTime()
println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
println("")
{quote}

I get:
{quote}
took: 9404 ms

took: 57350 ms

{quote}

Looking at it a little with a profiler, I see that the problem is with the 
SliceVector.update() and SparseVector.apply.

I currently work-around this by doing:
{quote}
blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
{quote}
like it was in version 1.6.




  was:
In SPARK-12869, this function was optimized for the case of dense matrices 
using Breeze. However, I have a case with very very sparse matrices which 
suffers a great deal from this optimization. A process we have that took about 
20 mins now takes about 6.5 hours.
Here is a sample code to see the difference:
{quote}
val n = 4
val density = 0.0002
val rnd = new Random(123)
val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
(rnd.nextInt(n), rnd.nextInt(n), rnd.nextDouble()))
  .groupBy(t => (t._1,t._2)).map(t => t._2.last).map { case 
(i,j,d) => (i,(j,d)) }.toSeq
val entries: RDD[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
Vectors.sparse(n, e._2.toSeq)))
val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)

val t1 = System.nanoTime()

println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t2 = System.nanoTime()
println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
println("")

println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t3 = System.nanoTime()
println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
println("")
{quote}

I get:
{quote}
took: 9404 ms

took: 57350 ms

{quote}

Looking at it a little with a profiler, I see that the problem is with the 
SliceVector.update() and SparseVector.apply.

I currently work-around this by doing:
BlockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
like it was in the previous version.





> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t =>

[jira] [Created] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-26 Thread Ohad Raviv (JIRA)

Ohad Raviv created SPARK-19368:
--

 Summary: Very bad performance in BlockMatrix.toIndexedRowMatrix()
 Key: SPARK-19368
 URL: https://issues.apache.org/jira/browse/SPARK-19368
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.0, 2.0.0
Reporter: Ohad Raviv


In SPARK-12869, this function was optimized for the case of dense matrices 
using Breeze. However, I have a case with very very sparse matrices which 
suffers a great deal from this optimization. A process we have that took about 
20 mins now takes about 6.5 hours.
Here is a sample code to see the difference:
{quote}
val n = 4
val density = 0.0002
val rnd = new Random(123)
val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
(rnd.nextInt(n), rnd.nextInt(n), rnd.nextDouble()))
  .groupBy(t => (t._1,t._2)).map(t => t._2.last).map { case 
(i,j,d) => (i,(j,d)) }.toSeq
val entries: RDD[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
Vectors.sparse(n, e._2.toSeq)))
val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)

val t1 = System.nanoTime()

println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t2 = System.nanoTime()
println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
println("")

println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
val t3 = System.nanoTime()
println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
println("")
{quote}

I get:
{quote}
took: 9404 ms

took: 57350 ms

{quote}

Looking at it a little with a profiler, I see that the problem is with the 
SliceVector.update() and SparseVector.apply.

I currently work-around this by doing:
BlockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
like it was in the previous version.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19367) Hive metastore temporary configuration doesn't specify default filesystem

2017-01-26 Thread Jacek Lewandowski (JIRA)

Jacek Lewandowski created SPARK-19367:
-

 Summary: Hive metastore temporary configuration doesn't specify 
default filesystem
 Key: SPARK-19367
 URL: https://issues.apache.org/jira/browse/SPARK-19367
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0
Reporter: Jacek Lewandowski
Priority: Minor


When a temporary configuration is obtained in HiveUtils, it specify the 
warehouse location on the local file system. This is fine, but if the default 
file system is specified different than local, it may fail due to the lack of 
required configuration options. 

What I propose here is to set default file system to {{file:///}} in addition 
to setting warehouse location to local file system. 

Why - because regardless of what file system is accessed by Hive, it does 
initialisation with the default file system and if it requires some 
configuration, it will fail. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo