[jira] [Resolved] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13125.
---
Resolution: Not A Problem

[~zhengcanbin] don't reopen an issue unless the discussion has substantially 
changed. I don't see that you addresses the point here. Shuffling is a 
solution; anything else mapping a topic partition to many has similar 
characteristics. At best you need to propose a specific mechanism. 

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13125:
--
Comment: was deleted

(was: Shuffle will increase net burden, and number of partitions is limited by 
total number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who run our application raise this problem, so I still think 
it makes sense.)

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129994#comment-15129994
 ] 

Davies Liu commented on SPARK-13157:


THis is introduced by https://github.com/apache/spark/pull/10905/files

cc [~hvanhovell]

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129995#comment-15129995
 ] 

Davies Liu commented on SPARK-13157:


Could be reproduce by 

{code}

  test("path with @") {
val plan = parser.parsePlan("ADD JAR /a/b@1/c.jar")
assert(plan.isInstanceOf[AddJar])
assert(plan.asInstanceOf[AddJar].path === "/a/b@1/c.jar")
  }

{code}

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13125:
--
Comment: was deleted

(was: Shuffle will increase net burden, and number of partitions is limited by 
total number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who run our application raise this problem, so I still think 
it makes sense.)

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-13125.
-

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-02-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13009.
---
Resolution: Not A Problem

> spark-streaming-twitter_2.10 does not make it possible to access the raw 
> twitter json
> -
>
> Key: SPARK-13009
> URL: https://issues.apache.org/jira/browse/SPARK-13009
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> The Streaming-twitter package makes it easy for Java programmers to work with 
> twitter. The implementation returns the raw twitter data in JSON formate as a 
> twitter4J StatusJSONImpl object
> JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);
> The status class is different then the raw JSON. I.E. serializing the status 
> object will be the same as the original json. I have down stream systems that 
> can only process raw tweets not twitter4J Status objects. 
> Here is my bug/RFE request made to Twitter4J . 
> They asked  I create a spark tracking issue.
> On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
> Hi All
> Quick problem summary:
> My system uses the Status objects to do some analysis how ever I need to 
> store the raw JSON. There are other systems that process that data that are 
> not written in Java.
> Currently we are serializing the Status Object. The JSON is going to break 
> down stream systems.
> I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
> Request For Enhancement:
> I imagine easy access to the raw JSON is a common requirement. Would it be 
> possible to add a member function to StatusJSONImpl getRawJson(). By default 
> the returned value would be null unless jsonStoreEnabled=True  is set in the 
> config.
> Alternative implementations:
>  
> It should be possible to modify the spark-streaming-twitter_2.10 to provide 
> this support. The solutions is not very clean
> It would required apache spark to define their own Status Pojo. The current 
> StatusJSONImpl class is marked final
> The Wrapper is not going to work nicely with existing code.
> spark-streaming-twitter_2.10  does not expose all of the twitter streaming 
> API so many developers are writing their implementations of 
> org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
> difficult. Its not easy to know when the spark implementation for twitter has 
> changed. 
> Code listing for 
> spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
>   @volatile private var twitterStream: TwitterStream = _
>   @volatile private var stopped = false
>   def onStart() {
> try {
>   val newTwitterStream = new 
> TwitterStreamFactory().getInstance(twitterAuth)
>   newTwitterStream.addListener(new StatusListener {
> def onStatus(status: Status): Unit = {
>   store(status)
> }
> Ref: 
> https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html
> What do people think?
> Kind regards
> Andy
> From:  on behalf of Igor Brigadir 
> 
> Reply-To: 
> Date: Tuesday, January 19, 2016 at 5:55 AM
> To: Twitter4J 
> Subject: Re: [Twitter4J] trouble writing unit test
> Main issue is that the Json object is in the wrong json format.
> eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
> + 2015", ...
> It looks like the json you have was serialized from a java Status object, 
> which makes json objects different to what you get from the API, 
> TwitterObjectFactory expects json from Twitter (I haven't had any problems 
> using TwitterObjectFactory instead of the Deprecated DataObjectFactory).
> You could "fix" it by matching the keys & values you have with the correct, 
> twitter API json - it should look like the example here: 
> https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid
> But it might be easier to download the tweets again, but this time use 
> TwitterObjectFactory.getRawJSON(status) to get the Original Json from the 
> Twitter API, and save that for later. (You must have jsonStoreEnabled=True in 
> your config, and call getRawJSON in the same thread as .showStatus() or 
> lookup() or whatever you're using to load tweets.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-13158) Show the information of broadcast blocks in WebUI

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130071#comment-15130071
 ] 

Apache Spark commented on SPARK-13158:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11046

> Show the information of broadcast blocks in WebUI
> -
>
> Key: SPARK-13158
> URL: https://issues.apache.org/jira/browse/SPARK-13158
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets a function to show the information of broadcast blocks, # 
> of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13158) Show the information of broadcast blocks in WebUI

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13158:


Assignee: (was: Apache Spark)

> Show the information of broadcast blocks in WebUI
> -
>
> Key: SPARK-13158
> URL: https://issues.apache.org/jira/browse/SPARK-13158
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets a function to show the information of broadcast blocks, # 
> of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13158) Show the information of broadcast blocks in WebUI

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13158:


Assignee: Apache Spark

> Show the information of broadcast blocks in WebUI
> -
>
> Key: SPARK-13158
> URL: https://issues.apache.org/jira/browse/SPARK-13158
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> This ticket targets a function to show the information of broadcast blocks, # 
> of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13139) Create native DDL commands

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13139:


Assignee: Apache Spark

> Create native DDL commands
> --
>
> Key: SPARK-13139
> URL: https://issues.apache.org/jira/browse/SPARK-13139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently delegate most DDLs directly to Hive, through NativePlaceholder 
> in HiveQl.scala. In Spark 2.0, we want to provide native implementations for 
> DDLs for both SQLContext and HiveContext.
> The first step is to properly parse these DDLs, and then create logical 
> commands that encapsulate them. The actual implementation can still delegate 
> to HiveNativeCommand. As an example, we should define a command for 
> RenameTable with the proper fields, and just delegate the implementation to 
> HiveNativeCommand (we might need to track the original sql query in order to 
> run HiveNativeCommand, but we can remove the sql query in the future once we 
> do the next step).
> Once we flush out the internal persistent catalog API, we can then switch the 
> implementation of these newly added commands to use the catalog API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13139) Create native DDL commands

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130174#comment-15130174
 ] 

Apache Spark commented on SPARK-13139:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11048

> Create native DDL commands
> --
>
> Key: SPARK-13139
> URL: https://issues.apache.org/jira/browse/SPARK-13139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently delegate most DDLs directly to Hive, through NativePlaceholder 
> in HiveQl.scala. In Spark 2.0, we want to provide native implementations for 
> DDLs for both SQLContext and HiveContext.
> The first step is to properly parse these DDLs, and then create logical 
> commands that encapsulate them. The actual implementation can still delegate 
> to HiveNativeCommand. As an example, we should define a command for 
> RenameTable with the proper fields, and just delegate the implementation to 
> HiveNativeCommand (we might need to track the original sql query in order to 
> run HiveNativeCommand, but we can remove the sql query in the future once we 
> do the next step).
> Once we flush out the internal persistent catalog API, we can then switch the 
> implementation of these newly added commands to use the catalog API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13139) Create native DDL commands

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13139:


Assignee: (was: Apache Spark)

> Create native DDL commands
> --
>
> Key: SPARK-13139
> URL: https://issues.apache.org/jira/browse/SPARK-13139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently delegate most DDLs directly to Hive, through NativePlaceholder 
> in HiveQl.scala. In Spark 2.0, we want to provide native implementations for 
> DDLs for both SQLContext and HiveContext.
> The first step is to properly parse these DDLs, and then create logical 
> commands that encapsulate them. The actual implementation can still delegate 
> to HiveNativeCommand. As an example, we should define a command for 
> RenameTable with the proper fields, and just delegate the implementation to 
> HiveNativeCommand (we might need to track the original sql query in order to 
> run HiveNativeCommand, but we can remove the sql query in the future once we 
> do the next step).
> Once we flush out the internal persistent catalog API, we can then switch the 
> implementation of these newly added commands to use the catalog API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8321:
---

Assignee: (was: Apache Spark)

> Authorization Support(on all operations not only DDL) in Spark Sql
> --
>
> Key: SPARK-8321
> URL: https://issues.apache.org/jira/browse/SPARK-8321
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Sunil
> Attachments: SparkSQLauthorizationDesignDocument.pdf
>
>
> Currently If you run Spark SQL with thrift server it only support 
> Authentication and limited authorization support(DDL). Want to extend it to 
> provide full authorization or provide a plug able authorization like Apache 
> sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8321:
---

Assignee: Apache Spark

> Authorization Support(on all operations not only DDL) in Spark Sql
> --
>
> Key: SPARK-8321
> URL: https://issues.apache.org/jira/browse/SPARK-8321
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Sunil
>Assignee: Apache Spark
> Attachments: SparkSQLauthorizationDesignDocument.pdf
>
>
> Currently If you run Spark SQL with thrift server it only support 
> Authentication and limited authorization support(DDL). Want to extend it to 
> provide full authorization or provide a plug able authorization like Apache 
> sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129990#comment-15129990
 ] 

Apache Spark commented on SPARK-8321:
-

User 'winningsix' has created a pull request for this issue:
https://github.com/apache/spark/pull/11045

> Authorization Support(on all operations not only DDL) in Spark Sql
> --
>
> Key: SPARK-8321
> URL: https://issues.apache.org/jira/browse/SPARK-8321
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Sunil
> Attachments: SparkSQLauthorizationDesignDocument.pdf
>
>
> Currently If you run Spark SQL with thrift server it only support 
> Authentication and limited authorization support(DDL). Want to extend it to 
> provide full authorization or provide a plug able authorization like Apache 
> sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-02-03 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130013#comment-15130013
 ] 

Adrian Wang commented on SPARK-12985:
-

I think this is a problem of Simba. JDBC never require a `Decimal` to be a 
`HiveDecimal`

> Spark Hive thrift server big decimal data issue
> ---
>
> Key: SPARK-12985
> URL: https://issues.apache.org/jira/browse/SPARK-12985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Alex Liu
>Priority: Minor
>
> I tested the trial version JDBC driver from Simba, it works for simple query. 
> But there is some issue with data mapping. e.g.
> {code}
> java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching 
> data rows: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   at 
> com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
> Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
>   at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
>   at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
> Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Caused by: com.simba.spark.support.exceptions.GeneralException: 
> [Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
> java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal;
>   ... 8 more
> {code}
> To fix it
> {code}
>case DecimalType() =>
>  -to += from.getDecimal(ordinal)
>  +to += HiveDecimal.create(from.getDecimal(ordinal))
> {code}
> to 
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13158) Show the information of broadcast blocks in WebUI

2016-02-03 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13158:


 Summary: Show the information of broadcast blocks in WebUI
 Key: SPARK-13158
 URL: https://issues.apache.org/jira/browse/SPARK-13158
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


This ticket targets a function to show the information of broadcast blocks, # 
of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13002:


Assignee: Apache Spark

> Mesos scheduler backend does not follow the property 
> spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-13002
> URL: https://issues.apache.org/jira/browse/SPARK-13002
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Luc Bourlier
>Assignee: Apache Spark
>  Labels: dynamic_allocation, mesos
>
> When starting a Spark job on a Mesos cluster, all available cores are 
> reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, 
> and as many executors as needed.
> This is the case even when dynamic allocation is enabled.
> When dynamic allocation is enabled, the number of executor launched at 
> startup should be limited to the value of 
> {{spark.dynamicAllocation.initialExecutors}}.
> The Mesos scheduler backend already follows the value computed by the 
> {{ExecutorAllocationManager}} for the number of executors that should be up 
> and running. Expect at startup, when it just creates all the executors it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130160#comment-15130160
 ] 

Apache Spark commented on SPARK-13002:
--

User 'skyluc' has created a pull request for this issue:
https://github.com/apache/spark/pull/11047

> Mesos scheduler backend does not follow the property 
> spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-13002
> URL: https://issues.apache.org/jira/browse/SPARK-13002
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Luc Bourlier
>  Labels: dynamic_allocation, mesos
>
> When starting a Spark job on a Mesos cluster, all available cores are 
> reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, 
> and as many executors as needed.
> This is the case even when dynamic allocation is enabled.
> When dynamic allocation is enabled, the number of executor launched at 
> startup should be limited to the value of 
> {{spark.dynamicAllocation.initialExecutors}}.
> The Mesos scheduler backend already follows the value computed by the 
> {{ExecutorAllocationManager}} for the number of executors that should be up 
> and running. Expect at startup, when it just creates all the executors it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13002:


Assignee: (was: Apache Spark)

> Mesos scheduler backend does not follow the property 
> spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-13002
> URL: https://issues.apache.org/jira/browse/SPARK-13002
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Luc Bourlier
>  Labels: dynamic_allocation, mesos
>
> When starting a Spark job on a Mesos cluster, all available cores are 
> reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, 
> and as many executors as needed.
> This is the case even when dynamic allocation is enabled.
> When dynamic allocation is enabled, the number of executor launched at 
> startup should be limited to the value of 
> {{spark.dynamicAllocation.initialExecutors}}.
> The Mesos scheduler backend already follows the value computed by the 
> {{ExecutorAllocationManager}} for the number of executors that should be up 
> and running. Expect at startup, when it just creates all the executors it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13103) HashTF dosn't count TF correctly

2016-02-03 Thread Louis Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130187#comment-15130187
 ] 

Louis Liu commented on SPARK-13103:
---

I'm sorry, you are right. The negative numbers doesn't matter.

Those code shall explain the problem:

>>> from pyspark.mllib.feature import HashingTF, IDF
>>> hashtf = HashingTF()
>>> hash('的問題哦')
-234244945207099392
>>> hash('豪們都把')
8689153874407194624
>>> hashtf.indexOf('的問題哦')
0 
>>> hashtf.indexOf('豪們都把')
0

> HashTF dosn't count TF correctly
> 
>
> Key: SPARK-13103
> URL: https://issues.apache.org/jira/browse/SPARK-13103
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04
> Python 3.4.3
>Reporter: Louis Liu
>
> I wrote a Python program to calculate frequencies of n-gram sequences with 
> HashTF.
> But it generate a strange output. It found more "一一下嗎" than "一一下".
> HashTF gets words' index with hash()
> But hashes of some Chinese words are negative.
> Ex:
> >>> hash('一一下嗎')
> -6433835193350070115
> >>> hash('一一下')
> -5938108283593463272



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-02-03 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-12807:
---
Target Version/s: 1.6.1

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130045#comment-15130045
 ] 

Herman van Hovell commented on SPARK-13157:
---

Hmmm... The lexer is swallowing @'s. The easiest way of solving this is to 
change the ASTNode's source and remaining vals to use the actual source string. 
I'll create fix today.

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-03 Thread Charles Drotar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Drotar updated SPARK-13156:
---
Description: 
I can successfully kick off a query through JDBC to Teradata, and when it runs 
it creates a task on each executor for every partition. The problem is that all 
of the tasks except for one complete within a couple seconds and the final task 
handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
db.table) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = 
sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.

  was:
I can successfully kick off a query through JDBC to Teradata, and when it runs 
it creates a task on each executor for every partition. The problem is that all 
of the tasks except for one complete within a couple seconds and the final task 
handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id
  FROM db.table
) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = 
sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.


> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see that 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-03 Thread Charles Drotar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Drotar updated SPARK-13156:
---
Description: 
I can successfully kick off a query through JDBC to Teradata, and when it runs 
it creates a task on each executor for every partition. The problem is that all 
of the tasks except for one complete within a couple seconds and the final task 
handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
db.table) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = 
sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see the 5 tasks, but only 1 is actually querying.

  was:
I can successfully kick off a query through JDBC to Teradata, and when it runs 
it creates a task on each executor for every partition. The problem is that all 
of the tasks except for one complete within a couple seconds and the final task 
handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
db.table) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = 
sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.


> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2016-02-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130351#comment-15130351
 ] 

Daniel Darabos commented on SPARK-1239:
---

I've read an interesting article about the "Kylix" butterfly allreduce 
(http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf). I think this is a direct 
solution to this problem and the authors say integration with Spark should be 
"easy".

Perhaps the same approach could be simulated within the current Spark shuffle 
implementation. I think the idea is to break up the M*R shuffle into an M*K and 
a K*R shuffle, where K is much less then M or R. So those K partitions will be 
large, but that should be fine.

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-03 Thread Charles Drotar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130411#comment-15130411
 ] 

Charles Drotar commented on SPARK-13156:


Thanks Sean for the quick response! 

That was exactly my initial thought. I created the modulo of the id column as 
the partition column to address any skewness by only looking at the 
distribution of the final digit as five bins. The total distinct number of ids 
are approximately 48 million and they are pretty evenly distributed between the 
five bins since they truly represent account id numbers. As a means of 
validation I made a simple python script to plot the modulo column. All five 
bins,0 through 4, are very close in counts and only differ by a minor amount.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-03 Thread zhengcanbin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130088#comment-15130088
 ] 

zhengcanbin commented on SPARK-13125:
-

Sorry, I got it, it's my first time to create a jira, I don't know how to deal 
with my jira properly when closed.

Next time I will submit a patch and propose a specific mechanism



> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131092#comment-15131092
 ] 

Apache Spark commented on SPARK-13163:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/11057

> Column width on new History Server DataTables not getting set correctly
> ---
>
> Key: SPARK-13163
> URL: https://issues.apache.org/jira/browse/SPARK-13163
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width_fixed.png, width_long_name.png
>
>
> The column width on the DataTable UI for the History Server is being set for 
> all entries in the table not just the current page. This means if there is 
> even one App with a long name in your history the table will look really odd 
> as seen below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13163:


Assignee: (was: Apache Spark)

> Column width on new History Server DataTables not getting set correctly
> ---
>
> Key: SPARK-13163
> URL: https://issues.apache.org/jira/browse/SPARK-13163
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width_fixed.png, width_long_name.png
>
>
> The column width on the DataTable UI for the History Server is being set for 
> all entries in the table not just the current page. This means if there is 
> even one App with a long name in your history the table will look really odd 
> as seen below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13164) Replace deprecated synchronizedBuffer in core

2016-02-03 Thread holdenk (JIRA)
holdenk created SPARK-13164:
---

 Summary: Replace deprecated synchronizedBuffer in core
 Key: SPARK-13164
 URL: https://issues.apache.org/jira/browse/SPARK-13164
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: holdenk
Priority: Minor


See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13116:


Assignee: Apache Spark

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>Assignee: Apache Spark
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131102#comment-15131102
 ] 

Apache Spark commented on SPARK-13116:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/11058

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13116:


Assignee: (was: Apache Spark)

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131170#comment-15131170
 ] 

Davies Liu commented on SPARK-13046:


I tried Spark 1.6 and master with a directory like this
{code}
test_data
test_data/._common_metadata.crc
test_data/._metadata.crc
test_data/._SUCCESS.crc
test_data/_common_metadata
test_data/_metadata
test_data/_SUCCESS
test_data/id=0
test_data/id=0/id2=0
test_data/id=0/id2=0/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=0/id2=0/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=1
test_data/id=1/id2=1
test_data/id=1/id2=1/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=1/id2=1/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=10
test_data/id=10/id2=10
test_data/id=10/id2=10/.part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet.crc
test_data/id=10/id2=10/part-r-0-2be05edc-caf1-4d13-8fc9-db6f1b8f2d6d.gz.parquet
test_data/id=11
{code}

It works well, is there special files in s3://bucket/some_path ?

> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> s3://bucket/some_path/date_received=2016-01-15
> {code}
> That is to say, the partitioning code now fails to identify 
> date_received=2016-01-13 as a key/value pair.
> I can see that there has been some activity on 
> spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
>  recently, so that seems related (especially the commits 
> https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b
>   and 
> https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14
>  ).
> If I read correctly the tests added in those commits:
> -they don't seem to actually test the return value, only that it doesn't crash
> -they only test cases where the s3 path contain 1 key/value pair (which 
> otherwise would catch the bug)
> This is problematic for us as we're trying to migrate all of our spark 
> services to 1.6.0 and this bug is a real blocker. I know it's possible to 
> force a 'union', but I'd rather not do that if the bug can be fixed.
> Any question, please shoot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13131:
---
Summary: Use best  time and average time in micro benchmark  (was: Use 
median time in benchmark)

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Median time should be more stable than average time in benchmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13166) Remove DataStreamReader/Writer

2016-02-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13166:
---

 Summary: Remove DataStreamReader/Writer
 Key: SPARK-13166
 URL: https://issues.apache.org/jira/browse/SPARK-13166
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


They seem redundant and we can simply use DataFrameReader/Writer. 

The usage looks like:
{code}
val df = sqlContext.read.stream("...")
val handle = df.write.stream("...")
handle.stop()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13163:


Assignee: Apache Spark

> Column width on new History Server DataTables not getting set correctly
> ---
>
> Key: SPARK-13163
> URL: https://issues.apache.org/jira/browse/SPARK-13163
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Apache Spark
>Priority: Minor
> Attachments: page_width_fixed.png, width_long_name.png
>
>
> The column width on the DataTable UI for the History Server is being set for 
> all entries in the table not just the current page. This means if there is 
> even one App with a long name in your history the table will look really odd 
> as seen below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13164) Replace deprecated synchronizedBuffer in core

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13164:


Assignee: (was: Apache Spark)

> Replace deprecated synchronizedBuffer in core
> -
>
> Key: SPARK-13164
> URL: https://issues.apache.org/jira/browse/SPARK-13164
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13164) Replace deprecated synchronizedBuffer in core

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131126#comment-15131126
 ] 

Apache Spark commented on SPARK-13164:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11059

> Replace deprecated synchronizedBuffer in core
> -
>
> Key: SPARK-13164
> URL: https://issues.apache.org/jira/browse/SPARK-13164
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131196#comment-15131196
 ] 

Apache Spark commented on SPARK-11316:
--

User 'zhuoliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11060

> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue, took 1 hours with the take vs a few minutes with the count().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11316:


Assignee: Apache Spark

> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue, took 1 hours with the take vs a few minutes with the count().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13131:
---
Description: Best time should be more stable than average time in 
benchmark, together with average time, they could show more information.  (was: 
Median time should be more stable than average time in benchmark.)

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Best time should be more stable than average time in benchmark, together with 
> average time, they could show more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13166) Remove DataStreamReader/Writer

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131252#comment-15131252
 ] 

Apache Spark commented on SPARK-13166:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11062

> Remove DataStreamReader/Writer
> --
>
> Key: SPARK-13166
> URL: https://issues.apache.org/jira/browse/SPARK-13166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> They seem redundant and we can simply use DataFrameReader/Writer. 
> The usage looks like:
> {code}
> val df = sqlContext.read.stream("...")
> val handle = df.write.stream("...")
> handle.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13166) Remove DataStreamReader/Writer

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13166:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove DataStreamReader/Writer
> --
>
> Key: SPARK-13166
> URL: https://issues.apache.org/jira/browse/SPARK-13166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> They seem redundant and we can simply use DataFrameReader/Writer. 
> The usage looks like:
> {code}
> val df = sqlContext.read.stream("...")
> val handle = df.write.stream("...")
> handle.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13166) Remove DataStreamReader/Writer

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13166:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove DataStreamReader/Writer
> --
>
> Key: SPARK-13166
> URL: https://issues.apache.org/jira/browse/SPARK-13166
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> They seem redundant and we can simply use DataFrameReader/Writer. 
> The usage looks like:
> {code}
> val df = sqlContext.read.stream("...")
> val handle = df.write.stream("...")
> handle.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9414) HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)

2016-02-03 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131054#comment-15131054
 ] 

Xiu (Joe) Guo commented on SPARK-9414:
--

With the current master 
[b938301|https://github.com/apache/spark/commit/b93830126cc59a26e2cfb5d7b3c17f9cfbf85988],
 I could not reproduce this issue by doing:

>From Hive 1.2.1 CLI:
{code}
create table test4DimBySpark (mydate int, hh int, x int, y int, height float, u 
float, v float, w float, ph float, phb float, p float, pb float, qva float, por 
float, qgraup float, qnice float, qnrain float, tke_pbl float, el_pbl float) 
partitioned by (zone int, z int, year int, month int);
{code}

In Spark-shell, use the first block of scala code from description to insert 
data.

I see correct partition directories in /user/hive/warehouse and Hive can read 
the data back fine.

Can you check with the newer versions of the code? It's probably fixed.

> HiveContext:saveAsTable creates wrong partition for existing hive 
> table(append mode)
> 
>
> Key: SPARK-9414
> URL: https://issues.apache.org/jira/browse/SPARK-9414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0.
>Reporter: Chetan Dalal
>Priority: Critical
>
> Raising this bug because I found this issue was ready reported on Apache mail 
> archive and I am facing a similar issue.
> ---original--
> I am using spark 1.4 and HiveContext to append data into a partitioned
> hive table. I found that the data insert into the table is correct, but the
> partition(folder) created is totally wrong.
> {code}
>  val schemaString = "zone z year month date hh x y height u v w ph phb 
> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
> val schema =
>   StructType(
> schemaString.split(" ").map(fieldName =>
>   if (fieldName.equals("zone") || fieldName.equals("z") ||
> fieldName.equals("year") || fieldName.equals("month") ||
>   fieldName.equals("date") || fieldName.equals("hh") ||
> fieldName.equals("x") || fieldName.equals("y"))
> StructField(fieldName, IntegerType, true)
>   else
> StructField(fieldName, FloatType, true)
> ))
> val pairVarRDD =
> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
> 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
> ))
> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
> {code}
> -
> The table contains 23 columns (longer than Tuple maximum length), so I
> use Row Object to store raw data, not Tuple.
> Here is some message from spark when it saved data>>
> {code}
> 
> 15/06/16 10:39:22 INFO metadata.Hive: Renaming
> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true
> 
> 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
> with partSpec {zone=13195, z=0, year=0, month=0}
> 
> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =
> 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x)
> 
> When I queried from hive>>
> 
> hive> select * from test4dimBySpark;
> OK
> 242200931.00.0218.0365.09989.497
> 29.62711319.0717930.11982734-3174.681297735.2 16.389032
> -96.6289125135.3652.6476808E-50.0 13195000
> hive> select zone, z, year, month from test4dimBySpark;
> OK
> 13195000
> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
> Found 2 items
> -rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39
> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1
> 
> The data stored in the table is correct zone = 2, z = 42, year = 2009,
> month = 3, but the partition created was wrong
> "zone=13195/z=0/year=0/month=0" 

[jira] [Updated] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-03 Thread Alex Bozarth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth updated SPARK-13163:
-
Attachment: width_long_name.png
page_width_fixed.png

I have a fix and will open a PR

> Column width on new History Server DataTables not getting set correctly
> ---
>
> Key: SPARK-13163
> URL: https://issues.apache.org/jira/browse/SPARK-13163
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width_fixed.png, width_long_name.png
>
>
> The column width on the DataTable UI for the History Server is being set for 
> all entries in the table not just the current page. This means if there is 
> even one App with a long name in your history the table will look really odd 
> as seen below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13164) Replace deprecated synchronizedBuffer in core

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13164:


Assignee: Apache Spark

> Replace deprecated synchronizedBuffer in core
> -
>
> Key: SPARK-13164
> URL: https://issues.apache.org/jira/browse/SPARK-13164
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11316:


Assignee: (was: Apache Spark)

> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue, took 1 hours with the take vs a few minutes with the count().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13163) Column width on new History Server DataTables not getting set correctly

2016-02-03 Thread Alex Bozarth (JIRA)
Alex Bozarth created SPARK-13163:


 Summary: Column width on new History Server DataTables not getting 
set correctly
 Key: SPARK-13163
 URL: https://issues.apache.org/jira/browse/SPARK-13163
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Alex Bozarth
Priority: Minor


The column width on the DataTable UI for the History Server is being set for 
all entries in the table not just the current page. This means if there is even 
one App with a long name in your history the table will look really odd as seen 
below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13151) Investigate replacing SynchronizedBuffer as it is deprecated/unreliable

2016-02-03 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131100#comment-15131100
 ] 

holdenk commented on SPARK-13151:
-

This seems pretty reasonable we already use concurrentlinkedqueue elsewhere in 
the code. I'm going to make sub tasks for each component so they don't get hung 
up on reviewers.

> Investigate replacing SynchronizedBuffer as it is deprecated/unreliable
> ---
>
> Key: SPARK-13151
> URL: https://issues.apache.org/jira/browse/SPARK-13151
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: holdenk
>Priority: Trivial
>
> Building with scala 2.11 results in the warning trait SynchronizedBuffer in 
> package mutable is deprecated: Synchronization via traits is deprecated as it 
> is inherently unreliable.  Consider 
> java.util.concurrent.ConcurrentLinkedQueue as an alternative - we should 
> investigate if this is a reasonable suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13150.
-
Resolution: Fixed
  Assignee: Herman van Hovell  (was: Cheng Lian)

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131138#comment-15131138
 ] 

Davies Liu commented on SPARK-13116:


Could you provide a test to reproduce this issue?

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13131) Use median time in benchmark

2016-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131205#comment-15131205
 ] 

Davies Liu commented on SPARK-13131:


[~piccolbo] Thanks for you comments, we also have lots of offline discussion 
and debates here. I'd like to provide more information here.

The benchmark, is a micro benchmark we used internal to measure some 
performance improvements on CPU bound workload. Each benchmark will have 
constant input (not randomly), ran after JVM is warmed up (the first run will 
be dropped).

Because this micro benchmark is used as a tool in daily development, it should 
not take too long to get a rough number, it also should not have too much noise 
without many retires (or the daily development will be slow down much). 

After some testing, We saw that the best time is much stable than average time, 
also reasonable. So we (Reynold, Nong and me) decided to show both best time 
and average time as result of benchmark.

> Use median time in benchmark
> 
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Median time should be more stable than average time in benchmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-02-03 Thread Suresh Thalamati (JIRA)
Suresh Thalamati created SPARK-13167:


 Summary: JDBC data source does not include null value partition 
columns rows in the result.
 Key: SPARK-13167
 URL: https://issues.apache.org/jira/browse/SPARK-13167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 2.0.0
Reporter: Suresh Thalamati


Reading from am JDBC data source using a partition column that is nullable can 
return incorrect number of rows, if there are rows with null value for 
partition column.

{code}
val emp = 
sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
"TEST.EMP", "theid", 0, 4, 3, new Properties)
emp.count()
{code}

Above jdbc read call sets up the partitions of the following form. It does not 
include null predicate.

{code}
JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
2,1),JDBCPartition(THEID >= 2,2)
{code}

Rows with null values in partition column are not included in the results 
because the partition predicate does not specify is null predicates.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13157) ADD JAR command cannot handle path with @ character

2016-02-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13157.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11052
[https://github.com/apache/spark/pull/11052]

> ADD JAR command cannot handle path with @ character
> ---
>
> Key: SPARK-13157
> URL: https://issues.apache.org/jira/browse/SPARK-13157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Blocker
> Fix For: 2.0.0
>
>
> To reproduce this issue locally, copy {{TestUDTF.jar}} under 
> {{$SPARK_HOME/sql/hive/src/test/resources/TestUDTF.jar}} to 
> {{/tmp/a@b/TestUDTF.jar}}. Then start the Thrift server and run the following 
> commands using Beeline:
> {noformat}
> > add jar file:///tmp/a@b/TestUDTF.jar;
> ...
> > CREATE TEMPORARY FUNCTION udtf_count2 AS 
> > 'org.apache.spark.sql.hive.execution.GenericUDTFCount2';
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask (state=,code=0)
> {noformat}
> Please refer to [this PR comment 
> thread|https://github.com/apache/spark/pull/11040] for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13165) Replace deprecated synchronizedBuffer in streaming

2016-02-03 Thread holdenk (JIRA)
holdenk created SPARK-13165:
---

 Summary: Replace deprecated synchronizedBuffer in streaming
 Key: SPARK-13165
 URL: https://issues.apache.org/jira/browse/SPARK-13165
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: holdenk
Priority: Trivial


See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns

2016-02-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12739.
--
Resolution: Fixed

> Details of batch in Streaming tab uses two Duration columns
> ---
>
> Key: SPARK-12739
> URL: https://issues.apache.org/jira/browse/SPARK-12739
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
> Attachments: SPARK-12739.png
>
>
> "Details of batch" screen in Streaming tab in web UI uses two Duration 
> columns. I think one should be "Processing Time" while the other "Job 
> Duration".
> See the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-02-03 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131175#comment-15131175
 ] 

Asif Hussain Shahid commented on SPARK-13116:
-

I will check if my tests encounter issue with latest code base.

> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -  mutableRow(i) = exprArray(i).eval(input)
>  +  if(targetUnsafe) {
>  +setters(i)(mutableRow.asInstanceOf[UnsafeRow], 
> exprArray(i).eval(input))
>  +  }else {
>  +mutableRow(i) = exprArray(i).eval(input)
>  +  }
> i += 1
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-02-03 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131258#comment-15131258
 ] 

Suresh Thalamati commented on SPARK-13167:
--

I am working on fix for this issue. 

> JDBC data source does not include null value partition columns rows in the 
> result.
> --
>
> Key: SPARK-13167
> URL: https://issues.apache.org/jira/browse/SPARK-13167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Suresh Thalamati
>
> Reading from am JDBC data source using a partition column that is nullable 
> can return incorrect number of rows, if there are rows with null value for 
> partition column.
> {code}
> val emp = 
> sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
> "TEST.EMP", "theid", 0, 4, 3, new Properties)
> emp.count()
> {code}
> Above jdbc read call sets up the partitions of the following form. It does 
> not include null predicate.
> {code}
> JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
> 2,1),JDBCPartition(THEID >= 2,2)
> {code}
> Rows with null values in partition column are not included in the results 
> because the partition predicate does not specify is null predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-02-03 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130471#comment-15130471
 ] 

Thomas Sebastian commented on SPARK-12982:
--

Adding changes in the SQLContext.scala and testing in progress with 
DataFrameSuite.scala with [~jayadevan.maym...@ibsplc.com]

> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>  Labels: sql
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> java.lang.RuntimeException: [1.1] failure: identifier expected
> 10e97c59_4445_479d_a7ef_d770e3874123
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
>   at 
> SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
>   at 
> SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42)
>   at 
> SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at sbt.Run.invokeMain(Run.scala:67)
>   at sbt.Run.run0(Run.scala:61)
>   at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
>   at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Logger$$anon$4.apply(Logger.scala:85)
>   at sbt.TrapExit$App.run(TrapExit.scala:248)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Code to reproduce this bug:
> https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13160) PySpark CDH 5

2016-02-03 Thread David Vega (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130572#comment-15130572
 ] 

David Vega commented on SPARK-13160:


I got to attach the files.


> PySpark CDH 5
> -
>
> Key: SPARK-13160
> URL: https://issues.apache.org/jira/browse/SPARK-13160
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, PySpark
>Affects Versions: 1.3.0
>Reporter: David Vega
> Attachments: job.properties, wordcount.py, workflow.xml
>
>
> Hi,
> I am trying to deploy my simple pyspark in CDH5 and it is almost impossible.
> I tried a lot of oozie configuration. It is difficult to find any right 
> documentation.
> I cann't attach the configuration, I write here:
> * wordcount.py
> import sys
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
> if len(sys.argv) != 2:
> print >> sys.stderr, "Usage: wordcount "
> exit(-1)
> sc = SparkContext(appName="PythonWordCount")
> lines = sc.textFile(sys.argv[1], 1)
> counts = lines.flatMap(lambda x: x.split(' ')) \
>  .map(lambda x: (x, 1)) \
>  .reduceByKey(add)
> output = counts.collect()
> for (word, count) in output:
> print "%s: %i" % (word, count)
> sc.stop()
> * workflow oozie
> 
> 
> ${jobTracker}
> ${nameNode}
> 
> 
> startDate
> 
> ${firstNotNull(wf:conf("initial-date"),firstNotNull(wf:conf("dateFromFile"),"sysdate"))}
> 
> 
> 
>  
>   
>   
> ${jobTracker}
> ${nameNode}
> yarn
> cluster
> ${spark_job_name}
> ${spark_code_path_jar_or_py}
> --executor-memory 256m --driver-memory 256m 
> --executor-cores 1 --num-executors 1 --conf 
> spark.yarn.queue=default
> ${nameNode}/group/saludar.txt
> 
> 
> 
> 
> 
> Hello World failed, error 
> message[${wf:errorMessage(wf:lastErrorNode())}]
> 
> 
> 
> I cann't attach the state my jobs, I write here
> Summary Metrics
> No tasks have started yet
> Tasks
> No tasks have started yet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.

2016-02-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130421#comment-15130421
 ] 

Steve Loughran commented on SPARK-8688:
---

Has anyone filed a bug against HDFS for this?

> Hadoop Configuration has to disable client cache when writing or reading 
> delegation tokens.
> ---
>
> Key: SPARK-8688
> URL: https://issues.apache.org/jira/browse/SPARK-8688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Fix For: 1.5.0
>
>
> In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, 
> Spark will write and read the credentials.
> But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use 
> cached  FileSystem (which will use old token ) to  upload or download file.
> Then when the old token is expired, it can't gain the auth to get/put the 
> hdfs.
> (I only tested in a very short time with the configuration:
> dfs.namenode.delegation.token.renew-interval=3min
> dfs.namenode.delegation.token.max-lifetime=10min
> I'm not sure whatever it matters.
>  )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12725:


Assignee: Xiao Li  (was: Apache Spark)

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-02-03 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130478#comment-15130478
 ] 

Thomas Sebastian commented on SPARK-12982:
--

One main difference for this with the description of this bug is that the 
latest spark 2.0 code throws Analysis Exception with the NoViableAlt..message 
for the same scenario. But we hope it is fine, as the fix would hold good for 
spark 1.6 and spark2.0

> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>  Labels: sql
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> java.lang.RuntimeException: [1.1] failure: identifier expected
> 10e97c59_4445_479d_a7ef_d770e3874123
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
>   at 
> SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
>   at 
> SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42)
>   at 
> SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at sbt.Run.invokeMain(Run.scala:67)
>   at sbt.Run.run0(Run.scala:61)
>   at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
>   at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Logger$$anon$4.apply(Logger.scala:85)
>   at sbt.TrapExit$App.run(TrapExit.scala:248)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Code to reproduce this bug:
> https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130481#comment-15130481
 ] 

Apache Spark commented on SPARK-12725:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11050

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12725:


Assignee: Apache Spark  (was: Xiao Li)

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130489#comment-15130489
 ] 

Apache Spark commented on SPARK-12982:
--

User 'jayadevanmurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/11051

> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>  Labels: sql
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> java.lang.RuntimeException: [1.1] failure: identifier expected
> 10e97c59_4445_479d_a7ef_d770e3874123
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
>   at 
> SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
>   at 
> SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42)
>   at 
> SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at sbt.Run.invokeMain(Run.scala:67)
>   at sbt.Run.run0(Run.scala:61)
>   at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
>   at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
>   at sbt.Logger$$anon$4.apply(Logger.scala:85)
>   at sbt.TrapExit$App.run(TrapExit.scala:248)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Code to reproduce this bug:
> https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13159) External shuffle service broken w/ Mesos

2016-02-03 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos closed SPARK-13159.
-
Resolution: Duplicate

> External shuffle service broken w/ Mesos
> 
>
> Key: SPARK-13159
> URL: https://issues.apache.org/jira/browse/SPARK-13159
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Iulian Dragos
>
> Dynamic allocation and external shuffle service won't work together on Mesos 
> for applications longer than {{spark.network.timeout}}.
> After two minutes (default value for {{spark.network.timeout}}), I see a lot 
> of FileNotFoundExceptions and spark jobs just fail.
> {code}
> 16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 
> 2755, 10.0.1.208): java.io.FileNotFoundException: 
> /tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0
>  (No such file or directory)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:221)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
>   at 
> org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
> ...
> {code}
> h3. Analysis
> The Mesos external shuffle service needs a way to know when it's safe to 
> delete shuffle files for a given application. The current solution (that 
> seemed to work fine while the RPC transport was based on Akka) was to open a 
> TCP connection between the driver and each external shuffle service. Once the 
> driver went down (graciously or crashed), the shuffle service would 
> eventually get a notification from the network layer, and delete the 
> corresponding files.
> This solution stopped working because it relies on an idle connection, and 
> the new Netty-based RPC layer is closing the connection after 
> {{spark.network.timeout}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13160) PySpark CDH 5

2016-02-03 Thread David Vega (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vega updated SPARK-13160:
---
Attachment: workflow.xml
wordcount.py
job.properties

> PySpark CDH 5
> -
>
> Key: SPARK-13160
> URL: https://issues.apache.org/jira/browse/SPARK-13160
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, PySpark
>Affects Versions: 1.3.0
>Reporter: David Vega
> Attachments: job.properties, wordcount.py, workflow.xml
>
>
> Hi,
> I am trying to deploy my simple pyspark in CDH5 and it is almost impossible.
> I tried a lot of oozie configuration. It is difficult to find any right 
> documentation.
> I cann't attach the configuration, I write here:
> * wordcount.py
> import sys
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
> if len(sys.argv) != 2:
> print >> sys.stderr, "Usage: wordcount "
> exit(-1)
> sc = SparkContext(appName="PythonWordCount")
> lines = sc.textFile(sys.argv[1], 1)
> counts = lines.flatMap(lambda x: x.split(' ')) \
>  .map(lambda x: (x, 1)) \
>  .reduceByKey(add)
> output = counts.collect()
> for (word, count) in output:
> print "%s: %i" % (word, count)
> sc.stop()
> * workflow oozie
> 
> 
> ${jobTracker}
> ${nameNode}
> 
> 
> startDate
> 
> ${firstNotNull(wf:conf("initial-date"),firstNotNull(wf:conf("dateFromFile"),"sysdate"))}
> 
> 
> 
>  
>   
>   
> ${jobTracker}
> ${nameNode}
> yarn
> cluster
> ${spark_job_name}
> ${spark_code_path_jar_or_py}
> --executor-memory 256m --driver-memory 256m 
> --executor-cores 1 --num-executors 1 --conf 
> spark.yarn.queue=default
> ${nameNode}/group/saludar.txt
> 
> 
> 
> 
> 
> Hello World failed, error 
> message[${wf:errorMessage(wf:lastErrorNode())}]
> 
> 
> 
> I cann't attach the state my jobs, I write here
> Summary Metrics
> No tasks have started yet
> Tasks
> No tasks have started yet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13159) External shuffle service broken w/ Mesos

2016-02-03 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-13159:
-

 Summary: External shuffle service broken w/ Mesos
 Key: SPARK-13159
 URL: https://issues.apache.org/jira/browse/SPARK-13159
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Iulian Dragos


Dynamic allocation and external shuffle service won't work together on Mesos 
for applications longer than {{spark.network.timeout}}.

After two minutes (default value for {{spark.network.timeout}}), I see a lot of 
FileNotFoundExceptions and spark jobs just fail.

{code}
16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 
10.0.1.208): java.io.FileNotFoundException: 
/tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0
 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
at 
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
...
{code}

h3. Analysis

The Mesos external shuffle service needs a way to know when it's safe to delete 
shuffle files for a given application. The current solution (that seemed to 
work fine while the RPC transport was based on Akka) was to open a TCP 
connection between the driver and each external shuffle service. Once the 
driver went down (graciously or crashed), the shuffle service would eventually 
get a notification from the network layer, and delete the corresponding files.

This solution stopped working because it relies on an idle connection, and the 
new Netty-based RPC layer is closing the connection after 
{{spark.network.timeout}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2016-02-03 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130554#comment-15130554
 ] 

Iulian Dragos commented on SPARK-12430:
---

I guess the thinking was that Mesos would clean up those sandboxes. If you're 
experiencing disk space issues, I think it would make sense to tweak Mesos:

{quote}
Garbage collection is scheduled based on the --gc_delay agent flag. By default, 
this is one week since the sandbox was last modified. After the delay, the 
files are deleted.
{quote}

https://mesos.apache.org/documentation/latest/sandbox/

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13160) PySpark CDH 5

2016-02-03 Thread David Vega (JIRA)
David Vega created SPARK-13160:
--

 Summary: PySpark CDH 5
 Key: SPARK-13160
 URL: https://issues.apache.org/jira/browse/SPARK-13160
 Project: Spark
  Issue Type: Question
  Components: Deploy, PySpark
Affects Versions: 1.3.0
Reporter: David Vega


Hi,
I am trying to deploy my simple pyspark in CDH5 and it is almost impossible.
I tried a lot of oozie configuration. It is difficult to find any right 
documentation.
I cann't attach the configuration, I write here:
* wordcount.py
import sys
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount "
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
 .map(lambda x: (x, 1)) \
 .reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)

sc.stop()

* workflow oozie


${jobTracker}
${nameNode}


startDate

${firstNotNull(wf:conf("initial-date"),firstNotNull(wf:conf("dateFromFile"),"sysdate"))}



 


${jobTracker}
${nameNode}
yarn
cluster
${spark_job_name}
${spark_code_path_jar_or_py}
--executor-memory 256m --driver-memory 256m 
--executor-cores 1 --num-executors 1 --conf 
spark.yarn.queue=default
${nameNode}/group/saludar.txt





Hello World failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]











I cann't attach the state my jobs, I write here
Summary Metrics
No tasks have started yet
Tasks
No tasks have started yet




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13167:


Assignee: Apache Spark

> JDBC data source does not include null value partition columns rows in the 
> result.
> --
>
> Key: SPARK-13167
> URL: https://issues.apache.org/jira/browse/SPARK-13167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Apache Spark
>
> Reading from am JDBC data source using a partition column that is nullable 
> can return incorrect number of rows, if there are rows with null value for 
> partition column.
> {code}
> val emp = 
> sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
> "TEST.EMP", "theid", 0, 4, 3, new Properties)
> emp.count()
> {code}
> Above jdbc read call sets up the partitions of the following form. It does 
> not include null predicate.
> {code}
> JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
> 2,1),JDBCPartition(THEID >= 2,2)
> {code}
> Rows with null values in partition column are not included in the results 
> because the partition predicate does not specify is null predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13167:


Assignee: (was: Apache Spark)

> JDBC data source does not include null value partition columns rows in the 
> result.
> --
>
> Key: SPARK-13167
> URL: https://issues.apache.org/jira/browse/SPARK-13167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Suresh Thalamati
>
> Reading from am JDBC data source using a partition column that is nullable 
> can return incorrect number of rows, if there are rows with null value for 
> partition column.
> {code}
> val emp = 
> sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
> "TEST.EMP", "theid", 0, 4, 3, new Properties)
> emp.count()
> {code}
> Above jdbc read call sets up the partitions of the following form. It does 
> not include null predicate.
> {code}
> JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
> 2,1),JDBCPartition(THEID >= 2,2)
> {code}
> Rows with null values in partition column are not included in the results 
> because the partition predicate does not specify is null predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13165) Replace deprecated synchronizedBuffer in streaming

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131383#comment-15131383
 ] 

Apache Spark commented on SPARK-13165:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11067

> Replace deprecated synchronizedBuffer in streaming
> --
>
> Key: SPARK-13165
> URL: https://issues.apache.org/jira/browse/SPARK-13165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: holdenk
>Priority: Trivial
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13165) Replace deprecated synchronizedBuffer in streaming

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13165:


Assignee: (was: Apache Spark)

> Replace deprecated synchronizedBuffer in streaming
> --
>
> Key: SPARK-13165
> URL: https://issues.apache.org/jira/browse/SPARK-13165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: holdenk
>Priority: Trivial
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13165) Replace deprecated synchronizedBuffer in streaming

2016-02-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13165:


Assignee: Apache Spark

> Replace deprecated synchronizedBuffer in streaming
> --
>
> Key: SPARK-13165
> URL: https://issues.apache.org/jira/browse/SPARK-13165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> See parent for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6715) Eliminate duplicate filters from pushdown predicates

2016-02-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6715.
---
Resolution: Won't Fix

I believe that this has been addressed by 
https://issues.apache.org/jira/browse/SPARK-10978

> Eliminate duplicate filters from pushdown predicates
> 
>
> Key: SPARK-6715
> URL: https://issues.apache.org/jira/browse/SPARK-6715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Now in {{DataSourceStrategy}}, the pushdown predicates are duplicate of 
> original {{Filter}} conditions. Thus, some predicates are performed both by 
> data source relation and {{Filter}} plan. I think it is a duplicate loading. 
> Once the predicates are pushed down and performed by a data source relation, 
> it
> can be eliminated from outside {{Filter}} plan's condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7376) Python: Add validation functionality to individual Param

2016-02-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131401#comment-15131401
 ] 

Seth Hendrickson commented on SPARK-7376:
-

I am seeing this Jira now after several related Jiras have been created. I am 
working on a solution for this. I can add the other Jiras as links or sub-tasks 
of this one.

> Python: Add validation functionality to individual Param
> 
>
> Key: SPARK-7376
> URL: https://issues.apache.org/jira/browse/SPARK-7376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Same as [SPARK-7176], but for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13174) Add API and options for csv data sources

2016-02-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13174:
--

 Summary: Add API and options for csv data sources
 Key: SPARK-13174
 URL: https://issues.apache.org/jira/browse/SPARK-13174
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu


We should have a API to load csv data source (with some options as arguments), 
similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13170) Investigate replacing SynchronizedQueue as it is deprecated

2016-02-03 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13170:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13175

> Investigate replacing SynchronizedQueue as it is deprecated
> ---
>
> Key: SPARK-13170
> URL: https://issues.apache.org/jira/browse/SPARK-13170
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming, Tests
>Reporter: holdenk
>Priority: Trivial
>
> In some of our tests we use SynchronizedQueue to append to the queue after 
> creating a queue stream. SynchronizedQueue is deprecated and we should see if 
> we can replace it. This is a bit tricky since the queue stream API is public, 
> and while it doesn't depend on having a SynchronizedQueue as input 
> (thankfully) it does require a Queue. We could possibly change the tests to 
> not depend on the SynchronizedQueue or change the QueueStream to also work 
> with Iterables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated

2016-02-03 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13171:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13175

> Update promise & future to Promise and Future as the old ones are deprecated
> 
>
> Key: SPARK-13171
> URL: https://issues.apache.org/jira/browse/SPARK-13171
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Priority: Trivial
>
> We use the promise and future functions on the concurrent object, both of 
> which have been deprecated in 2.11 . The full traits are present in Scala 
> 2.10 as well so this should be a safe migration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131430#comment-15131430
 ] 

Xusen Yin commented on SPARK-13178:
---

Ping [~mengxr] [~shivaram] to know about the concurrency issue. I am on my way 
to find a solution. It's better to know more from you since I am not the expert 
on this kind of bug.

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we get a pair (7.7, 7.7)
> we get a pair (6.6, 6.6)
> we get a pair (6.3, 6.3)
> we get a pair 

[jira] [Updated] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13178:
--
Description: 
In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
 which can be simplified in the following code:

{code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
val rdd =  ...
val rdd2 = rdd.map(x => x)
rdd.zip(rdd2).count()
{code}

However, RRDD fails on this operation with an error of "can only zip rdd with 
same number of elements" or "stream closed", similar to the JIRA issue: 
https://issues.apache.org/jira/browse/SPARK-2251

Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
operation, zip with self computes each partition twice. So if we zip a 
HadoopRDD (iris dataset) with itself, we get 

{code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
we get a pair (6.8, 6.8)
we get a pair (5.1, 5.1)
we get a pair (6.7, 6.7)
we get a pair (4.9, 4.9)
we get a pair (6.0, 6.0)
we get a pair (4.7, 4.7)
we get a pair (5.7, 5.7)
we get a pair (4.6, 4.6)
we get a pair (5.5, 5.5)
we get a pair (5.0, 5.0)
we get a pair (5.5, 5.5)
we get a pair (5.4, 5.4)
we get a pair (5.8, 5.8)
we get a pair (4.6, 4.6)
we get a pair (6.0, 6.0)
we get a pair (5.0, 5.0)
we get a pair (5.4, 5.4)
we get a pair (4.4, 4.4)
we get a pair (6.0, 6.0)
we get a pair (4.9, 4.9)
we get a pair (6.7, 6.7)
we get a pair (5.4, 5.4)
we get a pair (6.3, 6.3)
we get a pair (4.8, 4.8)
we get a pair (5.6, 5.6)
we get a pair (4.8, 4.8)
we get a pair (5.5, 5.5)
we get a pair (4.3, 4.3)
we get a pair (5.5, 5.5)
we get a pair (5.8, 5.8)
we get a pair (6.1, 6.1)
we get a pair (5.7, 5.7)
we get a pair (5.8, 5.8)
we get a pair (5.4, 5.4)
we get a pair (5.0, 5.0)
we get a pair (5.1, 5.1)
we get a pair (5.6, 5.6)
we get a pair (5.7, 5.7)
we get a pair (5.7, 5.7)
we get a pair (5.1, 5.1)
we get a pair (5.7, 5.7)
we get a pair (5.4, 5.4)
we get a pair (6.2, 6.2)
we get a pair (5.1, 5.1)
we get a pair (5.1, 5.1)
we get a pair (4.6, 4.6)
we get a pair (5.7, 5.7)
we get a pair (5.1, 5.1)
we get a pair (6.3, 6.3)
we get a pair (4.8, 4.8)
we get a pair (5.8, 5.8)
we get a pair (5.0, 5.0)
we get a pair (7.1, 7.1)
we get a pair (5.0, 5.0)
we get a pair (6.3, 6.3)
we get a pair (5.2, 5.2)
we get a pair (6.5, 6.5)
we get a pair (5.2, 5.2)
we get a pair (7.6, 7.6)
we get a pair (4.7, 4.7)
we get a pair (4.9, 4.9)
we get a pair (4.8, 4.8)
we get a pair (7.3, 7.3)
we get a pair (5.4, 5.4)
we get a pair (6.7, 6.7)
we get a pair (5.2, 5.2)
we get a pair (7.2, 7.2)
we get a pair (5.5, 5.5)
we get a pair (6.5, 6.5)
we get a pair (4.9, 4.9)
we get a pair (6.4, 6.4)
we get a pair (5.0, 5.0)
we get a pair (6.8, 6.8)
we get a pair (5.5, 5.5)
we get a pair (5.7, 5.7)
we get a pair (4.9, 4.9)
we get a pair (5.8, 5.8)
we get a pair (4.4, 4.4)
we get a pair (6.4, 6.4)
we get a pair (5.1, 5.1)
we get a pair (6.5, 6.5)
we get a pair (5.0, 5.0)
we get a pair (7.7, 7.7)
we get a pair (4.5, 4.5)
we get a pair (7.7, 7.7)
we get a pair (4.4, 4.4)
we get a pair (6.0, 6.0)
we get a pair (5.0, 5.0)
we get a pair (6.9, 6.9)
we get a pair (5.1, 5.1)
we get a pair (5.6, 5.6)
we get a pair (4.8, 4.8)
we get a pair (7.7, 7.7)
we get a pair (6.3, 6.3)
we get a pair (5.1, 5.1)
we get a pair (6.7, 6.7)
we get a pair (4.6, 4.6)
we get a pair (7.2, 7.2)
we get a pair (5.3, 5.3)
we get a pair (6.2, 6.2)
we get a pair (5.0, 5.0)
we get a pair (6.1, 6.1)
we get a pair (7.0, 7.0)
we get a pair (6.4, 6.4)
we get a pair (6.4, 6.4)
we get a pair (7.2, 7.2)
we get a pair (6.9, 6.9)
we get a pair (7.4, 7.4)
we get a pair (5.5, 5.5)
we get a pair (7.9, 7.9)
we get a pair (6.5, 6.5)
we get a pair (6.4, 6.4)
we get a pair (5.7, 5.7)
we get a pair (6.3, 6.3)
we get a pair (6.3, 6.3)
we get a pair (6.1, 6.1)
we get a pair (4.9, 4.9)
we get a pair (7.7, 7.7)
we get a pair (6.6, 6.6)
we get a pair (6.3, 6.3)
we get a pair (5.2, 5.2)
we get a pair (6.4, 6.4)
we get a pair (5.0, 5.0)
we get a pair (6.0, 6.0)
we get a pair (5.9, 5.9)
we get a pair (6.9, 6.9)
we get a pair (6.0, 6.0)
we get a pair (6.7, 6.7)
we get a pair (6.1, 6.1)
we get a pair (6.9, 6.9)
we get a pair (5.6, 5.6)
we get a pair (5.8, 5.8)
we get a pair (6.7, 6.7)
we get a pair (6.8, 6.8)
we get a pair (5.6, 5.6)
we get a pair (6.7, 6.7)
we get a pair (5.8, 5.8)
we get a pair (6.7, 6.7)
we get a pair (6.2, 6.2)
we get a pair (6.3, 6.3)
we get a pair (5.6, 5.6)
we get a pair (6.5, 6.5)
we get a pair (5.9, 5.9)
we get a pair (6.2, 6.2)
we get a pair (6.1, 6.1)
we get a pair (5.9, 5.9)
we get a pair (6.3, 6.3)
we get a pair (6.1, 6.1)
we get a pair (6.4, 6.4)
we get a pair (6.6, 6.6)
{code}

However, in RRDD with the same setting we get:


[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131436#comment-15131436
 ] 

Shivaram Venkataraman commented on SPARK-13178:
---

Hmm this is tricky to debug -- A higher level question: Why do we need to 
implement this using RRDD and zip on it ? The RRDD class is deprecated and 
going to go away soon. I thought the KMeans effort would only require wrapping 
the scala algorithm ?

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (6.9, 6.9)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (7.7, 7.7)
> we get a pair (6.3, 6.3)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.6, 4.6)
> we get a pair (7.2, 7.2)
> we get a pair (5.3, 5.3)
> we get a pair (6.2, 6.2)
> we get a pair (5.0, 5.0)
> we get a pair (6.1, 6.1)
> we get a pair (7.0, 7.0)
> we get a pair (6.4, 6.4)
> we get a pair (6.4, 6.4)
> we get a pair (7.2, 7.2)
> we get a pair (6.9, 6.9)
> we get a pair (7.4, 7.4)
> we get a pair (5.5, 5.5)
> we get a pair (7.9, 7.9)
> we get a pair (6.5, 6.5)
> we get a pair (6.4, 6.4)
> we get a pair (5.7, 5.7)
> we get a pair (6.3, 6.3)
> we get a pair (6.3, 6.3)
> we get a pair (6.1, 6.1)
> we get a pair (4.9, 4.9)
> we 

[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131447#comment-15131447
 ] 

Xiao Li commented on SPARK-12720:
-

CUBE(a, b, c) = GROUPING SETS((a,b,c), (a,b), (a,c), (b,c), (a), (b), (c), ())
ROLLUP(a, b, c) = GROUPING SETS((a,b,c), (a,b), (a), ())

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131285#comment-15131285
 ] 

Apache Spark commented on SPARK-13167:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/11063

> JDBC data source does not include null value partition columns rows in the 
> result.
> --
>
> Key: SPARK-13167
> URL: https://issues.apache.org/jira/browse/SPARK-13167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Suresh Thalamati
>
> Reading from am JDBC data source using a partition column that is nullable 
> can return incorrect number of rows, if there are rows with null value for 
> partition column.
> {code}
> val emp = 
> sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
> "TEST.EMP", "theid", 0, 4, 3, new Properties)
> emp.count()
> {code}
> Above jdbc read call sets up the partitions of the following form. It does 
> not include null predicate.
> {code}
> JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
> 2,1),JDBCPartition(THEID >= 2,2)
> {code}
> Rows with null values in partition column are not included in the results 
> because the partition predicate does not specify is null predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13131) Use best time and average time in micro benchmark

2016-02-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131353#comment-15131353
 ] 

Sean Owen commented on SPARK-13131:
---

Isn't best time on fact the best estimator of what this benchmark seeks to 
measure - how much pure computation X takes? Run time is true constant time 
taken to compute X plus noise from other things executing concurrently. The 
'error' can never be negative. Least run time is then the one with least error. 
Mean isn't that helpful in comparison. Stability is a secondary issue

> Use best  time and average time in micro benchmark
> --
>
> Key: SPARK-13131
> URL: https://issues.apache.org/jira/browse/SPARK-13131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Best time should be more stable than average time in benchmark, together with 
> average time, they could show more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13173) Fail to load CSV file with NPE

2016-02-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13173:
--

 Summary: Fail to load CSV file with NPE
 Key: SPARK-13173
 URL: https://issues.apache.org/jira/browse/SPARK-13173
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
id|end_date|start_date|location
1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF
2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD
3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY
4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY
5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD
{code}

{code}
adult_df = sqlContext.read.\
format("org.apache.spark.sql.execution.datasources.csv").\
option("header", "false").option("delimiter", "|").\
option("inferSchema", "true").load("/tmp/dataframe_sample.csv")
{code}

{code}
Py4JJavaError: An error occurred while calling o239.load.
: java.lang.NullPointerException
at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at 
scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93)
at 
scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-03 Thread Julien Baley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131418#comment-15131418
 ] 

Julien Baley commented on SPARK-13046:
--

Hi Davies,

I have no other file in the middle of the paths, we store everything after the 
fingerprint.

Could you perhaps try a structure closer to the one I described? With 2 
key/value pairs and every file stored after those? The heterogeneity of your 
example may make it work, somehow?

> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> s3://bucket/some_path/date_received=2016-01-15
> {code}
> That is to say, the partitioning code now fails to identify 
> date_received=2016-01-13 as a key/value pair.
> I can see that there has been some activity on 
> spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
>  recently, so that seems related (especially the commits 
> https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b
>   and 
> https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14
>  ).
> If I read correctly the tests added in those commits:
> -they don't seem to actually test the return value, only that it doesn't crash
> -they only test cases where the s3 path contain 1 key/value pair (which 
> otherwise would catch the bug)
> This is problematic for us as we're trying to migrate all of our spark 
> services to 1.6.0 and this bug is a real blocker. I know it's possible to 
> force a 'union', but I'd rather not do that if the bug can be fixed.
> Any question, please shoot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-03 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13176:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13175

> Ignore deprecation warning for ProcessBuilder lines_!
> -
>
> Key: SPARK-13176
> URL: https://issues.apache.org/jira/browse/SPARK-13176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> The replacement,  stream_! & lineStream_! is not present in 2.10 API.
> Note @SupressWarnings for deprecation doesn't appear to work 
> https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings 
> might involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13176) Ignore deprecation warning for ProcessBuilder lines_!

2016-02-03 Thread holdenk (JIRA)
holdenk created SPARK-13176:
---

 Summary: Ignore deprecation warning for ProcessBuilder lines_!
 Key: SPARK-13176
 URL: https://issues.apache.org/jira/browse/SPARK-13176
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: holdenk
Priority: Trivial


The replacement,  stream_! & lineStream_! is not present in 2.10 API.
Note @SupressWarnings for deprecation doesn't appear to work 
https://issues.scala-lang.org/browse/SI-7934 so suppressing the warnings might 
involve wrapping or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12514) Spark MetricsSystem can fill disks/cause OOMs when using GangliaSink

2016-02-03 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131439#comment-15131439
 ] 

Jonathan Kelly commented on SPARK-12514:


As of Spark 1.6.0, there don't seem to be *any* Spark metrics that are not 
prefixed by the YARN application ID, so "filter out application specific 
metrics" basically means "don't use Ganglia", right? I kid, but doesn't this 
make Spark+Ganglia integration pretty useless because Ganglia can't scale to 
the number of unique metrics that Spark is generating?

(I say, "as of Spark 1.6.0" because before Spark 1.6.0 the DAGScheduler metrics 
were not prefixed by the YARN application ID, but I see that this actually 
appears to have been a bug that was fixed in Spark 1.6.0 with 
https://issues.apache.org/jira/browse/SPARK-11828.)

> Spark MetricsSystem can fill disks/cause OOMs when using GangliaSink
> 
>
> Key: SPARK-12514
> URL: https://issues.apache.org/jira/browse/SPARK-12514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Aaron Tokhy
>Priority: Minor
>
> The MetricsSystem implementation in Spark generates unique metric names for 
> each spark application that has been submitted (to a YARN cluster, for 
> example).  This can be problematic for certain metrics environments, like 
> Ganglia.
> This creates metric names that look like the following (for each submitted 
> application):
> application_1450753701508_0001.driver.ExecutorAllocationManager.executors.numberAllExecutors
>  
> On Spark clusters where thousands of applications are submitted, some metrics 
> will eventually cause Ganglia daemons to reach their memory limits (gmond), 
> or to run out of disk space (gmetad).  This is due to the fact that some 
> existing metrics systems do not expect new metric names to be generated in 
> the lifetime of a cluster.
> Ganglia as a spark metrics sink is one example of where the current 
> implementation can run into problems.  Each new set of metrics per 
> application introduces a new set of RRD files that are never deleted (round 
> robin databases) and metrics in gmetad/gmond, which can cause the gmond 
> aggregator's memory usage to bloat over time, and gmetad to generate new 
> round robin databases for every new set of metrics, per application.  These 
> round robin databases are permanent, so each new set of metrics will 
> introduce files that would never be cleaned up.
> So the MetricsSystem may need to account for metrics sinks that have problems 
> with the introduction of new metrics, and buildRegistryName would have to 
> behave differently in this case.
> https://github.com/apache/spark/blob/d83c2f9f0b08d6d5d369d9fae04cdb15448e7f0d/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L126



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13178) RRDD faces with concurrency issue in case of rdd.zip(rdd).count()

2016-02-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131455#comment-15131455
 ] 

Xusen Yin commented on SPARK-13178:
---

I don't zip RRDD with itself. Actually, the bug exists when I calling KMeans 
from R side. I wrote the KMeans for SparkR in this JIRA 
https://issues.apache.org/jira/browse/SPARK-13011 with a code below:

model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", 
algorithm,
  x@sdf, iter.max, centers, "Sepal_Length Sepal_Width Petal_Length 
Petal_Width")

In the spark side, I wrote a fitKMeans in 
org.apache.spark.ml.api.r.SparkRWrappers:

def fitKMeans(
  initMode: String,
  df: DataFrame,
  maxIter: Double,
  k: Double,
  columns: String): KMeansModel = {
val assembler = new VectorAssembler().setInputCols(columns.split(" 
")).setOutputCol("features")
val features = assembler.transform(df).select("features")
val kMeans = new KMeans()
  .setInitMode(initMode)
  .setMaxIter(maxIter.toInt)
  .setK(k.toInt)
val model = kMeans.fit(features)
model
  }

The calling of KMeans have the code of rdd.zip(rdd.map(...)), and the rdd is 
derived from RRDD, so I cannot move on without fix it.

> RRDD faces with concurrency issue in case of rdd.zip(rdd).count()
> -
>
> Key: SPARK-13178
> URL: https://issues.apache.org/jira/browse/SPARK-13178
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Xusen Yin
>
> In Kmeans algorithm, there is a zip operation before taking samples, i.e. 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L210,
>  which can be simplified in the following code:
> {code:title=zip.scala|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> val rdd =  ...
> val rdd2 = rdd.map(x => x)
> rdd.zip(rdd2).count()
> {code}
> However, RRDD fails on this operation with an error of "can only zip rdd with 
> same number of elements" or "stream closed", similar to the JIRA issue: 
> https://issues.apache.org/jira/browse/SPARK-2251
> Inside RRDD, a data stream is used to ingest data from the R side. In the zip 
> operation, zip with self computes each partition twice. So if we zip a 
> HadoopRDD (iris dataset) with itself, we get 
> {code:title=log-from-zip-HadoopRDD|theme=FadeToGrey|linenumbers=true|language=scala|firstline=0001|collapse=false}
> we get a pair (6.8, 6.8)
> we get a pair (5.1, 5.1)
> we get a pair (6.7, 6.7)
> we get a pair (4.9, 4.9)
> we get a pair (6.0, 6.0)
> we get a pair (4.7, 4.7)
> we get a pair (5.7, 5.7)
> we get a pair (4.6, 4.6)
> we get a pair (5.5, 5.5)
> we get a pair (5.0, 5.0)
> we get a pair (5.5, 5.5)
> we get a pair (5.4, 5.4)
> we get a pair (5.8, 5.8)
> we get a pair (4.6, 4.6)
> we get a pair (6.0, 6.0)
> we get a pair (5.0, 5.0)
> we get a pair (5.4, 5.4)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair (4.9, 4.9)
> we get a pair (6.7, 6.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.6, 5.6)
> we get a pair (4.8, 4.8)
> we get a pair (5.5, 5.5)
> we get a pair (4.3, 4.3)
> we get a pair (5.5, 5.5)
> we get a pair (5.8, 5.8)
> we get a pair (6.1, 6.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.8, 5.8)
> we get a pair (5.4, 5.4)
> we get a pair (5.0, 5.0)
> we get a pair (5.1, 5.1)
> we get a pair (5.6, 5.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (5.7, 5.7)
> we get a pair (5.4, 5.4)
> we get a pair (6.2, 6.2)
> we get a pair (5.1, 5.1)
> we get a pair (5.1, 5.1)
> we get a pair (4.6, 4.6)
> we get a pair (5.7, 5.7)
> we get a pair (5.1, 5.1)
> we get a pair (6.3, 6.3)
> we get a pair (4.8, 4.8)
> we get a pair (5.8, 5.8)
> we get a pair (5.0, 5.0)
> we get a pair (7.1, 7.1)
> we get a pair (5.0, 5.0)
> we get a pair (6.3, 6.3)
> we get a pair (5.2, 5.2)
> we get a pair (6.5, 6.5)
> we get a pair (5.2, 5.2)
> we get a pair (7.6, 7.6)
> we get a pair (4.7, 4.7)
> we get a pair (4.9, 4.9)
> we get a pair (4.8, 4.8)
> we get a pair (7.3, 7.3)
> we get a pair (5.4, 5.4)
> we get a pair (6.7, 6.7)
> we get a pair (5.2, 5.2)
> we get a pair (7.2, 7.2)
> we get a pair (5.5, 5.5)
> we get a pair (6.5, 6.5)
> we get a pair (4.9, 4.9)
> we get a pair (6.4, 6.4)
> we get a pair (5.0, 5.0)
> we get a pair (6.8, 6.8)
> we get a pair (5.5, 5.5)
> we get a pair (5.7, 5.7)
> we get a pair (4.9, 4.9)
> we get a pair (5.8, 5.8)
> we get a pair (4.4, 4.4)
> we get a pair (6.4, 6.4)
> we get a pair (5.1, 5.1)
> we get a pair (6.5, 6.5)
> we get a pair (5.0, 5.0)
> we get a pair (7.7, 7.7)
> we get a pair (4.5, 4.5)
> we get a pair (7.7, 7.7)
> we get a pair (4.4, 4.4)
> we get a pair (6.0, 6.0)
> we get a pair 

[jira] [Created] (SPARK-13168) Collapse adjacent Repartition operations

2016-02-03 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-13168:
--

 Summary: Collapse adjacent Repartition operations
 Key: SPARK-13168
 URL: https://issues.apache.org/jira/browse/SPARK-13168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark SQL should collapse adjacent {{Repartition}} operators and only keep the 
last one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13095) improve performance of hash join with dimension table

2016-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131352#comment-15131352
 ] 

Apache Spark commented on SPARK-13095:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11065

> improve performance of hash join with dimension table
> -
>
> Key: SPARK-13095
> URL: https://issues.apache.org/jira/browse/SPARK-13095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The join key is usually an integer or long (primary key, unique), we could 
> have special HashRelation for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13149) Add FileStreamSource

2016-02-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13149:
-
Summary: Add FileStreamSource  (was: Add FileStreamSource and a simple 
version of FileStreamSink)

> Add FileStreamSource
> 
>
> Key: SPARK-13149
> URL: https://issues.apache.org/jira/browse/SPARK-13149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >