[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-05 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572714#comment-14572714
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/5/15 8:06 AM:
--

I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.
[~rxin]?


was (Author: animeshbaranawal):
I think this can be done by checking duplicates in the schema fieldname list of 
created dataframe before returning it.

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
>   at scala.collection.Iterator$$anon$11

[jira] [Updated] (SPARK-8121) "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8121:
--
Description: 
When {{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

  was:
When {{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

The reason is that, in 


> "spark.sql.parquet.output.committer.class" is overriden by 
> "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8121) "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8121:
--
Description: 
When {{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

The reason is that, in 

  was:
When "spark.sql.sources.outputCommitterClass" is configured, 
"spark.sql.parquet.output.committer.class" will be overriden. 

For example, if "spark.sql.parquet.output.committer.class" is set to 
FileOutputCommitter, while "spark.sql.sources.outputCommitterClass" is set to 
DirectParquetOutputCommitter, neither _metadata nor _common_metadata will be 
written because FileOutputCommitter overrides DirectParquetOutputCommitter.


> "spark.sql.parquet.output.committer.class" is overriden by 
> "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, in 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8123) Bucketizer must implement copy

2015-06-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8123:


 Summary: Bucketizer must implement copy
 Key: SPARK-8123
 URL: https://issues.apache.org/jira/browse/SPARK-8123
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8121:
--
Summary: When using with Hadoop 1.x, 
"spark.sql.parquet.output.committer.class" is overriden by 
"spark.sql.sources.outputCommitterClass"  (was: 
"spark.sql.parquet.output.committer.class" is overriden by 
"spark.sql.sources.outputCommitterClass")

> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8114) Remove wildcard import on TestSQLContext._

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574137#comment-14574137
 ] 

Apache Spark commented on SPARK-8114:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6667

> Remove wildcard import on TestSQLContext._
> --
>
> Key: SPARK-8114
> URL: https://issues.apache.org/jira/browse/SPARK-8114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We import TestSQLContext._ in almost all test suites. This import introduces 
> a lot of methods and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8121:
--
Description: 
When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
{{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
{{TaskAttemptContext}} before calling 
{{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
constructor clones the job configuration, thus doesn't share the job 
configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.

This issue can be fixed by simply [switching these two 
lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].

  was:
When {{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.


> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7596) Let AM's Reporter thread to wake up from sleep if new executors required

2015-06-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574145#comment-14574145
 ] 

Zoltán Zvara commented on SPARK-7596:
-

This has been fixed in 
[SPARK-8059|https://issues.apache.org/jira/browse/SPARK-8059].

> Let AM's Reporter thread to wake up from sleep if new executors required
> 
>
> Key: SPARK-7596
> URL: https://issues.apache.org/jira/browse/SPARK-7596
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Zoltán Zvara
>
> Allow {{ApplicationMaster}}'s {{Reporter}} thread to be interrupted between 
> RM heartbeats, when scheduling requests new executors, and start new 
> allocations immediately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8124) Created more examples on SparkR DataFrames

2015-06-05 Thread Daniel Emaasit (JIRA)
Daniel Emaasit created SPARK-8124:
-

 Summary: Created more examples on SparkR DataFrames
 Key: SPARK-8124
 URL: https://issues.apache.org/jira/browse/SPARK-8124
 Project: Spark
  Issue Type: New Feature
Reporter: Daniel Emaasit
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8124) Created more examples on SparkR DataFrames

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574162#comment-14574162
 ] 

Apache Spark commented on SPARK-8124:
-

User 'Emaasit' has created a pull request for this issue:
https://github.com/apache/spark/pull/6668

> Created more examples on SparkR DataFrames
> --
>
> Key: SPARK-8124
> URL: https://issues.apache.org/jira/browse/SPARK-8124
> Project: Spark
>  Issue Type: New Feature
>Reporter: Daniel Emaasit
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8124) Created more examples on SparkR DataFrames

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8124:
---

Assignee: Apache Spark

> Created more examples on SparkR DataFrames
> --
>
> Key: SPARK-8124
> URL: https://issues.apache.org/jira/browse/SPARK-8124
> Project: Spark
>  Issue Type: New Feature
>Reporter: Daniel Emaasit
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8124) Created more examples on SparkR DataFrames

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8124:
---

Assignee: (was: Apache Spark)

> Created more examples on SparkR DataFrames
> --
>
> Key: SPARK-8124
> URL: https://issues.apache.org/jira/browse/SPARK-8124
> Project: Spark
>  Issue Type: New Feature
>Reporter: Daniel Emaasit
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8121:
--
Description: 
When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
{{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
{{TaskAttemptContext}} before calling 
{{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
constructor clones the job configuration, thus doesn't share the job 
configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.

This issue can be fixed by simply [switching these two 
lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].

Here is a Spark shell snippet for reproducing this issue:
{code}
import sqlContext._

sc.hadoopConfiguration.set(
  "spark.sql.sources.outputCommitterClass",
  "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")

sc.hadoopConfiguration.set(
  "spark.sql.parquet.output.committer.class",
  "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
{code}
Then check {{/tmp/foo}}, Parquet summary files are missing:
{noformat}
/tmp/foo
├── _SUCCESS
├── part-r-1.gz.parquet
├── part-r-2.gz.parquet
├── part-r-3.gz.parquet
├── part-r-4.gz.parquet
├── part-r-5.gz.parquet
├── part-r-6.gz.parquet
├── part-r-7.gz.parquet
└── part-r-8.gz.parquet
{noformat}

  was:
When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
{{spark.sql.sources.outputCommitterClass}} is configured, 
{{spark.sql.parquet.output.committer.class}} will be overriden. 

For example, if {{spark.sql.parquet.output.committer.class}} is set to 
{{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
{{_common_metadata}} will be written because {{FileOutputCommitter}} overrides 
{{DirectParquetOutputCommitter}}.

The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
{{TaskAttemptContext}} before calling 
{{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
constructor clones the job configuration, thus doesn't share the job 
configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.

This issue can be fixed by simply [switching these two 
lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].


> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
>  

[jira] [Assigned] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8121:
---

Assignee: Apache Spark  (was: Cheng Lian)

> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
>   "spark.sql.sources.outputCommitterClass",
>   "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
>   "spark.sql.parquet.output.committer.class",
>   "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-1.gz.parquet
> ├── part-r-2.gz.parquet
> ├── part-r-3.gz.parquet
> ├── part-r-4.gz.parquet
> ├── part-r-5.gz.parquet
> ├── part-r-6.gz.parquet
> ├── part-r-7.gz.parquet
> └── part-r-8.gz.parquet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8121:
---

Assignee: Cheng Lian  (was: Apache Spark)

> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
>   "spark.sql.sources.outputCommitterClass",
>   "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
>   "spark.sql.parquet.output.committer.class",
>   "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-1.gz.parquet
> ├── part-r-2.gz.parquet
> ├── part-r-3.gz.parquet
> ├── part-r-4.gz.parquet
> ├── part-r-5.gz.parquet
> ├── part-r-6.gz.parquet
> ├── part-r-7.gz.parquet
> └── part-r-8.gz.parquet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8121) When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is overriden by "spark.sql.sources.outputCommitterClass"

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574198#comment-14574198
 ] 

Apache Spark commented on SPARK-8121:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6669

> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> ---
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
>   "spark.sql.sources.outputCommitterClass",
>   "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
>   "spark.sql.parquet.output.committer.class",
>   "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-1.gz.parquet
> ├── part-r-2.gz.parquet
> ├── part-r-3.gz.parquet
> ├── part-r-4.gz.parquet
> ├── part-r-5.gz.parquet
> ├── part-r-6.gz.parquet
> ├── part-r-7.gz.parquet
> └── part-r-8.gz.parquet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7106) Support model save/load in Python's FPGrowth

2015-06-05 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574225#comment-14574225
 ] 

Hrishikesh commented on SPARK-7106:
---

Do we have support for save/load in scala?

> Support model save/load in Python's FPGrowth
> 
>
> Key: SPARK-7106
> URL: https://issues.apache.org/jira/browse/SPARK-7106
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8118:
---

Assignee: Cheng Lian  (was: Apache Spark)

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574231#comment-14574231
 ] 

Apache Spark commented on SPARK-8118:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6670

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8118:
---

Assignee: Apache Spark  (was: Cheng Lian)

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4001) Add FP-growth algorithm to Spark MLlib

2015-06-05 Thread Guangwen Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574259#comment-14574259
 ] 

Guangwen Liu commented on SPARK-4001:
-

Hi, Xiangrui. Thanks for your great talk last time.
As for FP-growth algorithm in MLlib, i just find there isn't any API for 
association rules construction. In real use case, if no association rules is 
output, it would be powerless. Do you have any plan to add this feature?

Best,
Liu

> Add FP-growth algorithm to Spark MLlib
> --
>
> Key: SPARK-4001
> URL: https://issues.apache.org/jira/browse/SPARK-4001
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jacky Li
>Assignee: Jacky Li
> Fix For: 1.3.0
>
> Attachments: Distributed frequent item mining algorithm based on 
> Spark.pptx
>
>
> Apriori is the classic algorithm for frequent item set mining in a 
> transactional data set.  It will be useful if Apriori algorithm is added to 
> MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2015-06-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574287#comment-14574287
 ] 

Cheng Lian commented on SPARK-8122:
---

Just saw this JIRA ticket after opening [PR 
#6670|https://github.com/apache/spark/pull/6670].

I'm not super familiar with java.util.logging facilities, but [as commented in 
{{enableLogForwarding()}}|https://github.com/liancheng/spark/blob/spark-8118/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala],
 and according to [code of Parquet {{Log}} 
class|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-common/src/main/java/org/apache/parquet/Log.java#L58],
 a logger is already created in the static initialization block for package 
{{org.apache.parquet}}. In {{enableLogForwarding()}}, we are just retrieving a 
reference to an existing logger instead of creating a new logger instance. The 
same thing should also apply to the logger retrieved for 
{{ParquetOutputCommitter}} below.

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2015-06-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574287#comment-14574287
 ] 

Cheng Lian edited comment on SPARK-8122 at 6/5/15 10:39 AM:


Just saw this JIRA ticket after opening [PR 
#6670|https://github.com/apache/spark/pull/6670].

I'm not super familiar with java.util.logging facilities, but [as commented in 
{{enableLogForwarding()}}|https://github.com/apache/spark/blob/da20c8ca37663738112b04657057858ee3e55072/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala#L97-L109],
 and according to [code of Parquet {{Log}} 
class|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-common/src/main/java/org/apache/parquet/Log.java#L58],
 a logger is already created in the static initialization block for package 
{{org.apache.parquet}}. In {{enableLogForwarding()}}, we are just retrieving a 
reference to an existing logger instead of creating a new logger instance. The 
same thing should also apply to the logger retrieved for 
{{ParquetOutputCommitter}} below.


was (Author: lian cheng):
Just saw this JIRA ticket after opening [PR 
#6670|https://github.com/apache/spark/pull/6670].

I'm not super familiar with java.util.logging facilities, but [as commented in 
{{enableLogForwarding()}}|https://github.com/liancheng/spark/blob/spark-8118/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala],
 and according to [code of Parquet {{Log}} 
class|https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-common/src/main/java/org/apache/parquet/Log.java#L58],
 a logger is already created in the static initialization block for package 
{{org.apache.parquet}}. In {{enableLogForwarding()}}, we are just retrieving a 
reference to an existing logger instead of creating a new logger instance. The 
same thing should also apply to the logger retrieved for 
{{ParquetOutputCommitter}} below.

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2015-06-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574291#comment-14574291
 ] 

Cheng Lian commented on SPARK-8122:
---

Hm, just noticed that the logger created in the static initialization block of 
{{org.apache.parquet.Log}} is also a local variable.  Not quite sure about the 
lifetime of a local variable appeared in a static initialization block though, 
assuming it behaves the same as other local variable, namely it goes out of 
scope out of the static block.  Seems that this makes both 
{{ParquetRelation.enableLogForwarding()}} and Parquet itself suffer from the GC 
issue?

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7596) Let AM's Reporter thread to wake up from sleep if new executors required

2015-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7596.
--
Resolution: Duplicate

> Let AM's Reporter thread to wake up from sleep if new executors required
> 
>
> Key: SPARK-7596
> URL: https://issues.apache.org/jira/browse/SPARK-7596
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Zoltán Zvara
>
> Allow {{ApplicationMaster}}'s {{Reporter}} thread to be interrupted between 
> RM heartbeats, when scheduling requests new executors, and start new 
> allocations immediately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8125) Accelerate ParquetRelation2 metadata discovery

2015-06-05 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-8125:
-

 Summary: Accelerate ParquetRelation2 metadata discovery
 Key: SPARK-8125
 URL: https://issues.apache.org/jira/browse/SPARK-8125
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


For large Parquet tables (e.g., with thousands of partitions), it can be very 
slow to discover Parquet metadata for schema merging and generating splits for 
Spark jobs. We need to accelerate this processes. One possible solution is to 
do the discovery via a distributed Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-06-05 Thread Kaveen Raajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574359#comment-14574359
 ] 

Kaveen Raajan commented on SPARK-5493:
--

I'm using *SPARK-1.3.1* on *windows machine* contain space on username (current 
username-"kaveen raajan"). I tried to run following command

{code} spark-shell --master yarn-client --proxy-user SYSTEM {code}

I able to run successfully on non-space user application also running in SYSTEM 
user, But When I try to run in spaced user (kaveen raajan) mean it throws 
following error.

{code}
15/06/05 16:52:48 INFO spark.SecurityManager: Changing view acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: Changing modify acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: SecurityManager: authentication di
sabled; ui acls disabled; users with view permissions: Set(SYSTEM); users with m
odify permissions: Set(SYSTEM)
15/06/05 16:52:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/06/05 16:52:49 INFO Remoting: Starting remoting
15/06/05 16:52:49 INFO Remoting: Remoting started; listening on addresses :[akka
.tcp://sparkDriver@synclapn3408.CONTOSO:52137]
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'sparkDriver' on
 port 52137.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering MapOutputTracker
15/06/05 16:52:49 INFO spark.SparkEnv: Registering BlockManagerMaster
15/06/05 16:52:49 INFO storage.DiskBlockManager: Created local directory at C:\U
sers\KAVEEN~1\AppData\Local\Temp\spark-d5b43891-274c-457d-aa3a-d79a536fd536\bloc
kmgr-e980101b-4f93-455a-8a05-9185dcab9f8e
15/06/05 16:52:49 INFO storage.MemoryStore: MemoryStore started with capacity 26
5.4 MB
15/06/05 16:52:49 INFO spark.HttpFileServer: HTTP File server directory is C:\Us
ers\KAVEEN~1\AppData\Local\Temp\spark-a35e3f17-641c-4ae3-90f2-51eac901b799\httpd
-ecea93ad-c285-4c62-9222-01a9d6ff24e4
15/06/05 16:52:49 INFO spark.HttpServer: Starting HTTP Server
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0
:52138
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'HTTP file serve
r' on port 52138.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@
0.0.0.0:4040
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'SparkUI' on por
t 4040.
15/06/05 16:52:49 INFO ui.SparkUI: Started SparkUI at http://synclapn3408.CONTOS
O:4040
15/06/05 16:52:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0
:8032

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:145)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10
27)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:
1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:
1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840
)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8
56)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.sca
la:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:130)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoop
Init.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)

at 

[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos

2015-06-05 Thread Kaveen Raajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574359#comment-14574359
 ] 

Kaveen Raajan edited comment on SPARK-5493 at 6/5/15 11:43 AM:
---

I'm using *SPARK-1.3.1* on *windows machine* contain space on username (current 
username-"kaveen raajan"). I tried to run following command

{code} spark-shell --master yarn-client --proxy-user SYSTEM {code}

I able to run successfully on non-space user application also running in SYSTEM 
user, But When I try to run in spaced user (kaveen raajan) mean it throws 
following error.

{code}
15/06/05 16:52:48 INFO spark.SecurityManager: Changing view acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: Changing modify acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: SecurityManager: authentication di
sabled; ui acls disabled; users with view permissions: Set(SYSTEM); users with m
odify permissions: Set(SYSTEM)
15/06/05 16:52:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/06/05 16:52:49 INFO Remoting: Starting remoting
15/06/05 16:52:49 INFO Remoting: Remoting started; listening on addresses :[akka
.tcp://sparkDriver@Master:52137]
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'sparkDriver' on
 port 52137.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering MapOutputTracker
15/06/05 16:52:49 INFO spark.SparkEnv: Registering BlockManagerMaster
15/06/05 16:52:49 INFO storage.DiskBlockManager: Created local directory at C:\U
sers\KAVEEN~1\AppData\Local\Temp\spark-d5b43891-274c-457d-aa3a-d79a536fd536\bloc
kmgr-e980101b-4f93-455a-8a05-9185dcab9f8e
15/06/05 16:52:49 INFO storage.MemoryStore: MemoryStore started with capacity 26
5.4 MB
15/06/05 16:52:49 INFO spark.HttpFileServer: HTTP File server directory is C:\Us
ers\KAVEEN~1\AppData\Local\Temp\spark-a35e3f17-641c-4ae3-90f2-51eac901b799\httpd
-ecea93ad-c285-4c62-9222-01a9d6ff24e4
15/06/05 16:52:49 INFO spark.HttpServer: Starting HTTP Server
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0
:52138
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'HTTP file serve
r' on port 52138.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@
0.0.0.0:4040
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'SparkUI' on por
t 4040.
15/06/05 16:52:49 INFO ui.SparkUI: Started SparkUI at http://Master:4040
15/06/05 16:52:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0
:8032

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:145)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10
27)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:
1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:
1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840
)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8
56)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.sca
la:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:130)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoop
Init.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.sc

[jira] [Commented] (SPARK-8016) YARN cluster / client modes have different app names for python

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574396#comment-14574396
 ] 

Apache Spark commented on SPARK-8016:
-

User 'ehnalis' has created a pull request for this issue:
https://github.com/apache/spark/pull/6671

> YARN cluster / client modes have different app names for python
> ---
>
> Key: SPARK-8016
> URL: https://issues.apache.org/jira/browse/SPARK-8016
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
>Reporter: Andrew Or
> Attachments: python.png
>
>
> See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8016) YARN cluster / client modes have different app names for python

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8016:
---

Assignee: (was: Apache Spark)

> YARN cluster / client modes have different app names for python
> ---
>
> Key: SPARK-8016
> URL: https://issues.apache.org/jira/browse/SPARK-8016
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
>Reporter: Andrew Or
> Attachments: python.png
>
>
> See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8016) YARN cluster / client modes have different app names for python

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8016:
---

Assignee: Apache Spark

> YARN cluster / client modes have different app names for python
> ---
>
> Key: SPARK-8016
> URL: https://issues.apache.org/jira/browse/SPARK-8016
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Apache Spark
> Attachments: python.png
>
>
> See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos

2015-06-05 Thread Kaveen Raajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574359#comment-14574359
 ] 

Kaveen Raajan edited comment on SPARK-5493 at 6/5/15 12:12 PM:
---

I'm using *SPARK1.3.1* on *windows machine* contain space on username (current 
username-"kaveen raajan"). I tried to run following command

{code} spark-shell --master yarn-client --proxy-user SYSTEM {code}

I able to run successfully on non-space user application also running in SYSTEM 
user, But When I try to run in spaced user (kaveen raajan) mean it throws 
following error.

{code}
15/06/05 16:52:48 INFO spark.SecurityManager: Changing view acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: Changing modify acls to: SYSTEM
15/06/05 16:52:48 INFO spark.SecurityManager: SecurityManager: authentication di
sabled; ui acls disabled; users with view permissions: Set(SYSTEM); users with m
odify permissions: Set(SYSTEM)
15/06/05 16:52:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/06/05 16:52:49 INFO Remoting: Starting remoting
15/06/05 16:52:49 INFO Remoting: Remoting started; listening on addresses :[akka
.tcp://sparkDriver@Master:52137]
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'sparkDriver' on
 port 52137.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering MapOutputTracker
15/06/05 16:52:49 INFO spark.SparkEnv: Registering BlockManagerMaster
15/06/05 16:52:49 INFO storage.DiskBlockManager: Created local directory at C:\U
sers\KAVEEN~1\AppData\Local\Temp\spark-d5b43891-274c-457d-aa3a-d79a536fd536\bloc
kmgr-e980101b-4f93-455a-8a05-9185dcab9f8e
15/06/05 16:52:49 INFO storage.MemoryStore: MemoryStore started with capacity 26
5.4 MB
15/06/05 16:52:49 INFO spark.HttpFileServer: HTTP File server directory is C:\Us
ers\KAVEEN~1\AppData\Local\Temp\spark-a35e3f17-641c-4ae3-90f2-51eac901b799\httpd
-ecea93ad-c285-4c62-9222-01a9d6ff24e4
15/06/05 16:52:49 INFO spark.HttpServer: Starting HTTP Server
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0
:52138
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'HTTP file serve
r' on port 52138.
15/06/05 16:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/05 16:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@
0.0.0.0:4040
15/06/05 16:52:49 INFO util.Utils: Successfully started service 'SparkUI' on por
t 4040.
15/06/05 16:52:49 INFO ui.SparkUI: Started SparkUI at http://Master:4040
15/06/05 16:52:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0
:8032

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:145)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10
27)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:
1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:
1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840
)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8
56)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.sca
la:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:130)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply
(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoop
Init.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.sca

[jira] [Commented] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2015-06-05 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574419#comment-14574419
 ] 

Konstantin Shaposhnikov commented on SPARK-8122:


Parquet itself surfers from this issue too but its almost impossible to hit it 
because the static block in Log is most likely called very shortly before some 
Logger instance is strongly referenced from a static LOG field (Log -> Logger 
-> parent Logger). It is very unlikely that GC happens between these two events.

But when there is a bigger interval between a Logger is configured in 
`enableLogForwarding()` and is actually used to log something there is a bigger 
chance to see this.

In one of my applications I used similar code to redirect parquet logging to 
slf4j and saw once that the redirect wasn't setup properly due to GC.

To be honest I wish parquet just used slf4j and didn't mess with logging set up 
;)

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6324) Clean up usage code in command-line scripts

2015-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6324.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5841
[https://github.com/apache/spark/pull/5841]

> Clean up usage code in command-line scripts
> ---
>
> Key: SPARK-6324
> URL: https://issues.apache.org/jira/browse/SPARK-6324
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.5.0
>
>
> With SPARK-4924, most of the logic to launch Spark classes is in a new Java 
> library. Pretty much the only thing left in scripts are the usage strings for 
> each command; that uses some rather ugly and hacky code to handle, since it 
> requires the library communicating back with the scripts that they should 
> print a usage string instead of executing a command.
> The scripts have to process that special command (differently on bash and 
> Windows), and do filtering of the actual output of usage strings to account 
> for different commands.
> Instead, the library itself should handle all this by executing the classes 
> with a "help" argument; and the classes should be able to handle that 
> argument to do the right thing. So this would require both changes in the 
> launcher library, and in all the main entry points to make sure they properly 
> respond to the "help" by printing the correct help message.
> This would make things a lot cleaner and a lot easier to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6324) Clean up usage code in command-line scripts

2015-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6324:
-
Assignee: Marcelo Vanzin

> Clean up usage code in command-line scripts
> ---
>
> Key: SPARK-6324
> URL: https://issues.apache.org/jira/browse/SPARK-6324
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.5.0
>
>
> With SPARK-4924, most of the logic to launch Spark classes is in a new Java 
> library. Pretty much the only thing left in scripts are the usage strings for 
> each command; that uses some rather ugly and hacky code to handle, since it 
> requires the library communicating back with the scripts that they should 
> print a usage string instead of executing a command.
> The scripts have to process that special command (differently on bash and 
> Windows), and do filtering of the actual output of usage strings to account 
> for different commands.
> Instead, the library itself should handle all this by executing the classes 
> with a "help" argument; and the classes should be able to handle that 
> argument to do the right thing. So this would require both changes in the 
> launcher library, and in all the main entry points to make sure they properly 
> respond to the "help" by printing the correct help message.
> This would make things a lot cleaner and a lot easier to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1018) take and collect don't work on HadoopRDD

2015-06-05 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574472#comment-14574472
 ] 

Igor Berman commented on SPARK-1018:


Hi Patrick, 
We spent some time to understand why data is "corrupted" while working with 
avro objects that wasn't copied at our layer, and yes I've seen the note above 
newHadoopApiFile, but note doesn't describe the consequences of not copying 
objects, at least not for people who came to spark without deep knowledge in 
hadoop format(I think there are plenty of those)

Do you think that even using read-avro - transform - write-avro chain can be 
corrupted due to not copying avro objects in the beginning? 
e.g. we've seen that several objects has same data when they shouldn't, this 
was solved by deep-copy of whole avro-object

If you permit me to suggest, it would be nice to have section about working 
with haddopRDD or newHadoopRDD which will advice on best practices and "do-s" 
and "don't-s" when working with hadoop files


> take and collect don't work on HadoopRDD
> 
>
> Key: SPARK-1018
> URL: https://issues.apache.org/jira/browse/SPARK-1018
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1
>Reporter: Diana Carroll
>  Labels: hadoop
>
> I am reading a simple text file using hadoopFile as follows:
> var hrdd1 = 
> sc.hadoopFile("/home/training/testdata.txt",classOf[TextInputFormat], 
> classOf[LongWritable], classOf[Text])
> Testing using this simple text file:
> 001 this is line 1
> 002 this is line two
> 003 yet another line
> the data read is correct, as I can tell using println 
> scala> hrdd1.foreach(println):
> (0,001 this is line 1)
> (19,002 this is line two)
> (40,003 yet another line)
> But neither collect nor take work properly.  Take prints out the key (byte 
> offset) of the last (non-existent) line repeatedly:
> scala> hrdd1.take(4):
> res146: Array[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] 
> = Array((61,), (61,), (61,))
> Collect is even worse: it complains:
> java.io.NotSerializableException: org.apache.hadoop.io.LongWritable at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
> The problem appears to be the LongWritable in both cases, because if I map to 
> a new RDD, converting the values from Text objects to strings, it works:
> scala> hrdd1.map(pair => (pair._1.toString,pair._2.toString)).take(4)
> res148: Array[(java.lang.String, java.lang.String)] = Array((0,001 this is 
> line 1), (19,002 this is line two), (40,003 yet another line))
> Seems to me either rdd.collect and rdd.take ought to handle non-serializable 
> types gracefully, or hadoopFile should return a mapped RDD that converts the 
> hadoop types into the appropriate serializable Java objects.  (Or at very 
> least the docs for the API should indicate that the usual RDD methods don't 
> work on HadoopRDDs).
> BTW, this behavior is the same for both the old and new API versions of 
> hadoopFile.  It also is the same whether the file is from HDFS or a plain old 
> text file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6107) event log file ends with .inprogress should be able to display on webUI for standalone mode

2015-06-05 Thread Octavian Ganea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574510#comment-14574510
 ] 

Octavian Ganea commented on SPARK-6107:
---

I still see this in 1.3.1 . I have spark.eventLog.enabled  true and 
spark.eventLog.dir set.

> event log file ends with .inprogress should be able to display on webUI for 
> standalone mode
> ---
>
> Key: SPARK-6107
> URL: https://issues.apache.org/jira/browse/SPARK-6107
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
> Fix For: 1.4.0
>
>
> when application is finished running abnormally (Ctrl + c for example), the 
> history event log file is still ends with *.inprogress* suffix. And the 
> application state can not be showed on webUI, User can just see "*Application 
> history not foud , Application xxx is still in progress*".  
> User should also can see the status of the abnormal finished applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-06-05 Thread Octavian Ganea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574514#comment-14574514
 ] 

Octavian Ganea commented on SPARK-6950:
---

Happens to me on 1.3.1

> Spark master UI believes some applications are in progress when they are 
> actually completed
> ---
>
> Key: SPARK-6950
> URL: https://issues.apache.org/jira/browse/SPARK-6950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
> Fix For: 1.3.1
>
>
> In Spark 1.2.x, I was able to set my spark event log directory to be a 
> different location from the default, and after the job finishes, I can replay 
> the UI by clicking on the appropriate link under "Completed Applications".
> Now, on a non-deterministic basis (but seems to happen most of the time), 
> when I click on the link under "Completed Applications", I instead get a 
> webpage that says:
> Application history not found (app-20150415052927-0014)
> Application myApp is still in progress.
> I am able to view the application's UI using the Spark history server, so 
> something regressed in the Spark master code between 1.2 and 1.3, but that 
> regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4072) Storage UI does not reflect memory usage by streaming blocks

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574691#comment-14574691
 ] 

Apache Spark commented on SPARK-4072:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6672

> Storage UI does not reflect memory usage by streaming blocks
> 
>
> Key: SPARK-4072
> URL: https://issues.apache.org/jira/browse/SPARK-4072
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> The storage page in the web ui does not show the memory usage of non-RDD, 
> non-Broadcast blocks. In other words, the memory used by data received 
> through Spark Streaming is not shown on the web ui. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574750#comment-14574750
 ] 

Yu Ishikawa commented on SPARK-5992:


[~debasish83] Thank you for your comment. I haven't compared algebird LSH with 
spark's one. And I don't know if adding algebird to mllib is ok. However, as 
you're suggesting, I think we should use it in Spark Streaming. 

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8056) Design an easier way to construct schema for both Scala and Python

2015-06-05 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574847#comment-14574847
 ] 

Ilya Ganelin commented on SPARK-8056:
-

[~rxin] Sounds good :). Where would you suggest adding a test for StructType 
creation? Not sure where it quite fits in the grand scheme of things. 

> Design an easier way to construct schema for both Scala and Python
> --
>
> Key: SPARK-8056
> URL: https://issues.apache.org/jira/browse/SPARK-8056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> StructType is fairly hard to construct, especially in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8056) Design an easier way to construct schema for both Scala and Python

2015-06-05 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574847#comment-14574847
 ] 

Ilya Ganelin edited comment on SPARK-8056 at 6/5/15 5:17 PM:
-

[~rxin] Sounds good :). Where would you suggest adding a test for StructType 
creation? Not sure where it quite fits in the grand scheme of things. 

With regards to also supporting a string for simple types, I think it's safer 
to enforce usage of DataType since the SQL schema should be strictly typed. 
Were you suggesting that we allow passing "int" or "long" as the type argument 
or for us to infer it automatically by parsing the string? That approach seems 
a little more dangerous.


was (Author: ilganeli):
[~rxin] Sounds good :). Where would you suggest adding a test for StructType 
creation? Not sure where it quite fits in the grand scheme of things. 

> Design an easier way to construct schema for both Scala and Python
> --
>
> Key: SPARK-8056
> URL: https://issues.apache.org/jira/browse/SPARK-8056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> StructType is fairly hard to construct, especially in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8056) Design an easier way to construct schema for both Scala and Python

2015-06-05 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574847#comment-14574847
 ] 

Ilya Ganelin edited comment on SPARK-8056 at 6/5/15 5:18 PM:
-

[~rxin] Sounds good :). Where would you suggest adding a test for StructType 
creation? Not sure where it quite fits in the grand scheme of things. 

With regards to also supporting a string for simple types, I think it's safer 
to enforce usage of DataType since I think the intent is for the SQL schema to 
be strictly typed. Were you suggesting that we allow passing "int" or "long" as 
the type argument or for us to infer it automatically by parsing the string? 
That approach seems a little more dangerous.


was (Author: ilganeli):
[~rxin] Sounds good :). Where would you suggest adding a test for StructType 
creation? Not sure where it quite fits in the grand scheme of things. 

With regards to also supporting a string for simple types, I think it's safer 
to enforce usage of DataType since the SQL schema should be strictly typed. 
Were you suggesting that we allow passing "int" or "long" as the type argument 
or for us to infer it automatically by parsing the string? That approach seems 
a little more dangerous.

> Design an easier way to construct schema for both Scala and Python
> --
>
> Key: SPARK-8056
> URL: https://issues.apache.org/jira/browse/SPARK-8056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> StructType is fairly hard to construct, especially in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8085) Pass in user-specified schema in read.df

2015-06-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8085.
--
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

Issue resolved by pull request 6620
[https://github.com/apache/spark/pull/6620]

> Pass in user-specified schema in read.df
> 
>
> Key: SPARK-8085
> URL: https://issues.apache.org/jira/browse/SPARK-8085
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
> Fix For: 1.5.0, 1.4.1
>
>
> This will help cases where we use the CSV reader and want each column to be 
> of a specific type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574933#comment-14574933
 ] 

Yu Ishikawa commented on SPARK-5992:


h2. Initial version of desing doc

https://github.com/yu-iskw/SPARK-5992-LSH-design-doc/blob/master/design-doc.md

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8099) In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf

2015-06-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8099:
-
Affects Version/s: 1.0.0

> In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf
> ---
>
> Key: SPARK-8099
> URL: https://issues.apache.org/jira/browse/SPARK-8099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: meiyoula
>
> While testing dynamic executor allocation function, I set the executor cores 
> with *--executor-cores 4* in spark-submit command. But in 
> *ExecutorAllocationManager*, the *private val tasksPerExecutor 
> =conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)* 
> is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8056) Design an easier way to construct schema for both Scala and Python

2015-06-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574940#comment-14574940
 ] 

Reynold Xin commented on SPARK-8056:


Add it here: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

We already take strings in DataFrame's cast. I think we can just accept the 
same style of strings (int, long...)

> Design an easier way to construct schema for both Scala and Python
> --
>
> Key: SPARK-8056
> URL: https://issues.apache.org/jira/browse/SPARK-8056
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> StructType is fairly hard to construct, especially in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8077) Optimisation of TreeNode for large number of children

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574941#comment-14574941
 ] 

Apache Spark commented on SPARK-8077:
-

User 'MickDavies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6673

> Optimisation of TreeNode for large number of children
> -
>
> Key: SPARK-8077
> URL: https://issues.apache.org/jira/browse/SPARK-8077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Mick Davies
>Priority: Minor
>
> Large IN clauses are parsed very slowly. For example SQL below (10K items in 
> IN) takes 45-50s. 
> {code}
> s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map("n" + 
> _).mkString("','")}')"""
> {code}
> This is principally due to TreeNode which repeatedly call contains on 
> children, where children in this case is a List that is 10K long. In effect 
> parsing for large IN clauses is O(N squared).
> A small change that uses a lazily initialised Set based on children for 
> contains reduces parse time to around 2.5s
> I'd like to create PR for change, as we often use IN clauses with a few 
> thousand items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8102) Big performance difference when joining 3 tables in different order

2015-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574943#comment-14574943
 ] 

Yin Huai commented on SPARK-8102:
-

Can you post query plans with these two queries?

You can use {{EXPLAIN }} or {{sqlContext.sql(...).explain()}} to 
get the plan.

> Big performance difference when joining 3 tables in different order
> ---
>
> Key: SPARK-8102
> URL: https://issues.apache.org/jira/browse/SPARK-8102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: spark in local mode
>Reporter: Hao Ren
>
> Given 3 tables loaded from CSV files: 
> ( tables name => size)
> *click_meter_site_grouped* =>10 687 455 bytes
> *t_zipcode* => 2 738 954 bytes
> *t_category* => 2 182 bytes
> When joining the 3 tables, I notice a large performance difference if they 
> are joined in different order.
> Here are the SQL queries to compare:
> {code}
> -- snippet 1
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, t_zipcode z, click_meter_site_grouped g
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> {code}
> -- snippet 2
> SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
> FROM t_category c, click_meter_site_grouped g, t_zipcode z
> WHERE c.refCategoryID = g.category AND z.regionCode = g.region
> {code}
> As you see, the largest table *click_meter_site_grouped* is the last table in 
> FROM clause in the first snippet,  and it is in the middle of table list in 
> second one.
> Snippet 2 runs three times faster than Snippet 1.
> (8 seconds VS 24 seconds)
> As the data is just sampled from a large data set, if we test it on the 
> original data set, it will normally result in a performance issue.
> After checking the log, we found something strange In snippet 1's log:
> 15/06/04 15:32:03 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
> 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
> file:/home/i

[jira] [Commented] (SPARK-8105) sqlContext.table("databaseName.tableName") broke with SPARK-6908

2015-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574956#comment-14574956
 ] 

Yin Huai commented on SPARK-8105:
-

Actually, as I said in the mailing list, it is not officially supported. It 
works with 1.3 because it triggered a not well defined Hive code path (it may 
introduce bug in other cases). It is not intentionally supported in 1.3. 

I am changing the type to new feature. We will add it to sqlContext.

> sqlContext.table("databaseName.tableName") broke with SPARK-6908
> 
>
> Key: SPARK-8105
> URL: https://issues.apache.org/jira/browse/SPARK-8105
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark with Hive
>Reporter: Doug Balog
>Priority: Critical
>
> Since the introduction of Dataframes in Spark 1.3.0 and prior to SPARK-6908 
> landing into master, a user could get a DataFrame to a Hive table using 
> `sqlContext.table("databaseName.tableName")` 
> Since SPARK-6908, the user now receives a NoSuchTableException.
> This amounts to a change in  non experimental sqlContext.table() api and will 
> require user code to be modified to work properly with 1.4.0.
> The only viable work around I could find is
> `sqlContext.sql("select * from databseName.tableName")`
> which seems like a hack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8105) sqlContext.table("databaseName.tableName") broke with SPARK-6908

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8105:

Issue Type: New Feature  (was: Bug)

> sqlContext.table("databaseName.tableName") broke with SPARK-6908
> 
>
> Key: SPARK-8105
> URL: https://issues.apache.org/jira/browse/SPARK-8105
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark with Hive
>Reporter: Doug Balog
>Priority: Critical
>
> Since the introduction of Dataframes in Spark 1.3.0 and prior to SPARK-6908 
> landing into master, a user could get a DataFrame to a Hive table using 
> `sqlContext.table("databaseName.tableName")` 
> Since SPARK-6908, the user now receives a NoSuchTableException.
> This amounts to a change in  non experimental sqlContext.table() api and will 
> require user code to be modified to work properly with 1.4.0.
> The only viable work around I could find is
> `sqlContext.sql("select * from databseName.tableName")`
> which seems like a hack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8105) sqlContext.table("databaseName.tableName") broke with SPARK-6908

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8105:

Target Version/s: 1.4.1, 1.5.0  (was: 1.4.1)

> sqlContext.table("databaseName.tableName") broke with SPARK-6908
> 
>
> Key: SPARK-8105
> URL: https://issues.apache.org/jira/browse/SPARK-8105
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark with Hive
>Reporter: Doug Balog
>Priority: Critical
>
> Since the introduction of Dataframes in Spark 1.3.0 and prior to SPARK-6908 
> landing into master, a user could get a DataFrame to a Hive table using 
> `sqlContext.table("databaseName.tableName")` 
> Since SPARK-6908, the user now receives a NoSuchTableException.
> This amounts to a change in  non experimental sqlContext.table() api and will 
> require user code to be modified to work properly with 1.4.0.
> The only viable work around I could find is
> `sqlContext.sql("select * from databseName.tableName")`
> which seems like a hack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8105) sqlContext.table("databaseName.tableName") broke with SPARK-6908

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8105:

Target Version/s: 1.5.0  (was: 1.4.1, 1.5.0)

> sqlContext.table("databaseName.tableName") broke with SPARK-6908
> 
>
> Key: SPARK-8105
> URL: https://issues.apache.org/jira/browse/SPARK-8105
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark with Hive
>Reporter: Doug Balog
>Priority: Critical
>
> Since the introduction of Dataframes in Spark 1.3.0 and prior to SPARK-6908 
> landing into master, a user could get a DataFrame to a Hive table using 
> `sqlContext.table("databaseName.tableName")` 
> Since SPARK-6908, the user now receives a NoSuchTableException.
> This amounts to a change in  non experimental sqlContext.table() api and will 
> require user code to be modified to work properly with 1.4.0.
> The only viable work around I could find is
> `sqlContext.sql("select * from databseName.tableName")`
> which seems like a hack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8077) Optimisation of TreeNode for large number of children

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8077:
---

Assignee: Apache Spark

> Optimisation of TreeNode for large number of children
> -
>
> Key: SPARK-8077
> URL: https://issues.apache.org/jira/browse/SPARK-8077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Mick Davies
>Assignee: Apache Spark
>Priority: Minor
>
> Large IN clauses are parsed very slowly. For example SQL below (10K items in 
> IN) takes 45-50s. 
> {code}
> s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map("n" + 
> _).mkString("','")}')"""
> {code}
> This is principally due to TreeNode which repeatedly call contains on 
> children, where children in this case is a List that is 10K long. In effect 
> parsing for large IN clauses is O(N squared).
> A small change that uses a lazily initialised Set based on children for 
> contains reduces parse time to around 2.5s
> I'd like to create PR for change, as we often use IN clauses with a few 
> thousand items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8077) Optimisation of TreeNode for large number of children

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8077:
---

Assignee: (was: Apache Spark)

> Optimisation of TreeNode for large number of children
> -
>
> Key: SPARK-8077
> URL: https://issues.apache.org/jira/browse/SPARK-8077
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Mick Davies
>Priority: Minor
>
> Large IN clauses are parsed very slowly. For example SQL below (10K items in 
> IN) takes 45-50s. 
> {code}
> s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map("n" + 
> _).mkString("','")}')"""
> {code}
> This is principally due to TreeNode which repeatedly call contains on 
> children, where children in this case is a List that is 10K long. In effect 
> parsing for large IN clauses is O(N squared).
> A small change that uses a lazily initialised Set based on children for 
> contains reduces parse time to around 2.5s
> I'd like to create PR for change, as we often use IN clauses with a few 
> thousand items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8093) Failure to save empty json object as parquet

2015-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574952#comment-14574952
 ] 

Yin Huai commented on SPARK-8093:
-

[~rhbutani] Is your test based on RC4?

> Failure to save empty json object as parquet
> 
>
> Key: SPARK-8093
> URL: https://issues.apache.org/jira/browse/SPARK-8093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Harish Butani
> Attachments: t1.json
>
>
> This is similar to SPARK-3365. Sample json is attached. Code to reproduce
> {code}
> var jsonDF = read.json("/tmp/t1.json")
> jsonDF.write.parquet("/tmp/t1.parquet")
> {code}
> The 'integration' object is empty in the json.
> StackTrace:
> {code}
> 
> Caused by: java.io.IOException: Could not read footer: 
> java.lang.IllegalStateException: Cannot build an empty group
>   at 
> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134)
>   ... 69 more
> Caused by: java.lang.IllegalStateException: Cannot build an empty group
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8107) sqlContext.table() should be able to take a database name as an additional argument.

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8107.
-
Resolution: Duplicate

> sqlContext.table() should be able to take a database name as an additional 
> argument.
> 
>
> Key: SPARK-8107
> URL: https://issues.apache.org/jira/browse/SPARK-8107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Doug Balog
>
> sqlContext table should take an addition argument to specify a databaseName.
> {code}
> def table(databaseName: String, tableName: String): DataFrame =
>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison

2015-06-05 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574983#comment-14574983
 ] 

Russell Alexander Spitzer commented on SPARK-6987:
--

Or being able to specify an identifier for each spark worker that wasn't 
dependent on ip?

> Node Locality is determined with String Matching instead of Inet Comparison
> ---
>
> Key: SPARK-6987
> URL: https://issues.apache.org/jira/browse/SPARK-6987
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Russell Alexander Spitzer
>
> When determining whether or not a task can be run NodeLocal the 
> TaskSetManager ends up using a direct string comparison between the 
> preferredIp and the executor's bound interface.
> https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880
> https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490
> This means that the preferredIp must be a direct string match of the ip the 
> the worker is bound to. This means that apis which are gathering data from 
> other distributed sources must develop their own mapping between the 
> interfaces bound (or exposed) by the external sources and the interface bound 
> by the Spark executor since these may be different. 
> For example, Cassandra exposes a broadcast rpc address which doesn't have to 
> match the address which the service is bound to. This means when adding 
> preferredLocation data we must add both the rpc and the listen address to 
> ensure that we can get a string match (and of course we are out of luck if 
> Spark has been bound on to another interface). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-05 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-8126:
-

 Summary: Use temp directory under build dir for unit tests
 Key: SPARK-8126
 URL: https://issues.apache.org/jira/browse/SPARK-8126
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Marcelo Vanzin
Priority: Minor


Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
to clean things up. Let's place those files under the build dir so that 
"mvn|sbt|git clean" can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8099) In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-8099.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
 Assignee: meiyoula

> In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf
> ---
>
> Key: SPARK-8099
> URL: https://issues.apache.org/jira/browse/SPARK-8099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: meiyoula
>Assignee: meiyoula
> Fix For: 1.5.0
>
>
> While testing dynamic executor allocation function, I set the executor cores 
> with *--executor-cores 4* in spark-submit command. But in 
> *ExecutorAllocationManager*, the *private val tasksPerExecutor 
> =conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)* 
> is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8099) In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8099:
--
Assignee: (was: meiyoula)

> In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf
> ---
>
> Key: SPARK-8099
> URL: https://issues.apache.org/jira/browse/SPARK-8099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: meiyoula
> Fix For: 1.5.0
>
>
> While testing dynamic executor allocation function, I set the executor cores 
> with *--executor-cores 4* in spark-submit command. But in 
> *ExecutorAllocationManager*, the *private val tasksPerExecutor 
> =conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)* 
> is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8099) In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-8099:
--
Assignee: meiyoula

> In yarn-cluster mode, "--executor-cores" can't be setted into SparkConf
> ---
>
> Key: SPARK-8099
> URL: https://issues.apache.org/jira/browse/SPARK-8099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: meiyoula
>Assignee: meiyoula
> Fix For: 1.5.0
>
>
> While testing dynamic executor allocation function, I set the executor cores 
> with *--executor-cores 4* in spark-submit command. But in 
> *ExecutorAllocationManager*, the *private val tasksPerExecutor 
> =conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)* 
> is still to be 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575002#comment-14575002
 ] 

Karl Higley commented on SPARK-5992:


To make it easier to define a common interface, it might help to restrict 
consideration to methods that produce hash signatures.  For cosine similarity, 
sign-random-projection LSH would probably fit the bill.  See Section 3 of 
"Similarity Estimation Techniques from Rounding
Algorithms"
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575003#comment-14575003
 ] 

Apache Spark commented on SPARK-8126:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6674

> Use temp directory under build dir for unit tests
> -
>
> Key: SPARK-8126
> URL: https://issues.apache.org/jira/browse/SPARK-8126
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
> to clean things up. Let's place those files under the build dir so that 
> "mvn|sbt|git clean" can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8126:
---

Assignee: (was: Apache Spark)

> Use temp directory under build dir for unit tests
> -
>
> Key: SPARK-8126
> URL: https://issues.apache.org/jira/browse/SPARK-8126
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
> to clean things up. Let's place those files under the build dir so that 
> "mvn|sbt|git clean" can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8126:
---

Assignee: Apache Spark

> Use temp directory under build dir for unit tests
> -
>
> Key: SPARK-8126
> URL: https://issues.apache.org/jira/browse/SPARK-8126
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
> to clean things up. Let's place those files under the build dir so that 
> "mvn|sbt|git clean" can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8093) Failure to save empty json object as parquet

2015-06-05 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575013#comment-14575013
 ] 

Harish Butani commented on SPARK-8093:
--

yes

> Failure to save empty json object as parquet
> 
>
> Key: SPARK-8093
> URL: https://issues.apache.org/jira/browse/SPARK-8093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Harish Butani
> Attachments: t1.json
>
>
> This is similar to SPARK-3365. Sample json is attached. Code to reproduce
> {code}
> var jsonDF = read.json("/tmp/t1.json")
> jsonDF.write.parquet("/tmp/t1.parquet")
> {code}
> The 'integration' object is empty in the json.
> StackTrace:
> {code}
> 
> Caused by: java.io.IOException: Could not read footer: 
> java.lang.IllegalStateException: Cannot build an empty group
>   at 
> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134)
>   ... 69 more
> Caused by: java.lang.IllegalStateException: Cannot build an empty group
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8093) Failure to save empty json object as parquet

2015-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575023#comment-14575023
 ] 

Yin Huai commented on SPARK-8093:
-

When you get time, can you try it with master? We just bumped the parquet to 
1.7. I am wondering if the logic of ParquetFileReader.readAllFootersInParallel 
has been changed to handle it or now.

> Failure to save empty json object as parquet
> 
>
> Key: SPARK-8093
> URL: https://issues.apache.org/jira/browse/SPARK-8093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Harish Butani
> Attachments: t1.json
>
>
> This is similar to SPARK-3365. Sample json is attached. Code to reproduce
> {code}
> var jsonDF = read.json("/tmp/t1.json")
> jsonDF.write.parquet("/tmp/t1.parquet")
> {code}
> The 'integration' object is empty in the json.
> StackTrace:
> {code}
> 
> Caused by: java.io.IOException: Could not read footer: 
> java.lang.IllegalStateException: Cannot build an empty group
>   at 
> parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134)
>   ... 69 more
> Caused by: java.lang.IllegalStateException: Cannot build an empty group
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8064) Upgrade Hive to 1.2

2015-06-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575070#comment-14575070
 ] 

Steve Loughran commented on SPARK-8064:
---

I'm working on this

> Upgrade Hive to 1.2
> ---
>
> Key: SPARK-8064
> URL: https://issues.apache.org/jira/browse/SPARK-8064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Steve Loughran
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8064) Upgrade Hive to 1.2

2015-06-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8064:
---
Assignee: Steve Loughran

> Upgrade Hive to 1.2
> ---
>
> Key: SPARK-8064
> URL: https://issues.apache.org/jira/browse/SPARK-8064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Steve Loughran
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7699) Dynamic allocation: initial executors may be canceled before first job

2015-06-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-7699.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Dynamic allocation: initial executors may be canceled before first job
> --
>
> Key: SPARK-7699
> URL: https://issues.apache.org/jira/browse/SPARK-7699
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: meiyoula
>Assignee: Saisai Shao
> Fix For: 1.5.0
>
>
> spark.dynamicAllocation.minExecutors 2
> spark.dynamicAllocation.initialExecutors  3
> spark.dynamicAllocation.maxExecutors 4
> Just run the spark-shell with above configurations, the initial executor 
> number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-06-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575079#comment-14575079
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

I'm just noting here that [~yuu.ishik...@gmail.com] and I discussed the design 
doc.  The current plan is to have these implemented under ml.feature, with 2 
types of UnaryTransformers:
* Vector -> T: hash feature vector to some other type (Vector, Int, etc.)
* Vector -> Double: compare feature vector with a fixed base Vector, and return 
similarity

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2015-06-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575083#comment-14575083
 ] 

Joseph K. Bradley commented on SPARK-7334:
--

I won't be able to look at the PR right away, but I will try to before long.  
(It looks very well-documented and tested!)  In the meantime, can you please 
view [~yuu.ishik...@gmail.com]'s design doc on [SPARK-5992] and discuss a good 
common interface?

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8112) Received block event count through the StreamingListener can be negative

2015-06-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-8112.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

> Received block event count through the StreamingListener can be negative
> 
>
> Key: SPARK-8112
> URL: https://issues.apache.org/jira/browse/SPARK-8112
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.1, 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8127) KafkaRDD optimize count() take() isEmpty()

2015-06-05 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-8127:
-

 Summary: KafkaRDD optimize count() take() isEmpty()
 Key: SPARK-8127
 URL: https://issues.apache.org/jira/browse/SPARK-8127
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger
Priority: Minor


KafkaRDD can use offset range to avoid doing extra work

Possibly related to SPARK-7122



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8127) KafkaRDD optimize count() take() isEmpty()

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8127:
---

Assignee: (was: Apache Spark)

> KafkaRDD optimize count() take() isEmpty()
> --
>
> Key: SPARK-8127
> URL: https://issues.apache.org/jira/browse/SPARK-8127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Priority: Minor
>
> KafkaRDD can use offset range to avoid doing extra work
> Possibly related to SPARK-7122



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8127) KafkaRDD optimize count() take() isEmpty()

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575184#comment-14575184
 ] 

Apache Spark commented on SPARK-8127:
-

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/6632

> KafkaRDD optimize count() take() isEmpty()
> --
>
> Key: SPARK-8127
> URL: https://issues.apache.org/jira/browse/SPARK-8127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Priority: Minor
>
> KafkaRDD can use offset range to avoid doing extra work
> Possibly related to SPARK-7122



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8127) KafkaRDD optimize count() take() isEmpty()

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8127:
---

Assignee: Apache Spark

> KafkaRDD optimize count() take() isEmpty()
> --
>
> Key: SPARK-8127
> URL: https://issues.apache.org/jira/browse/SPARK-8127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Apache Spark
>Priority: Minor
>
> KafkaRDD can use offset range to avoid doing extra work
> Possibly related to SPARK-7122



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8107) sqlContext.table() should be able to take a database name as an additional argument.

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575189#comment-14575189
 ] 

Apache Spark commented on SPARK-8107:
-

User 'dougb' has created a pull request for this issue:
https://github.com/apache/spark/pull/6675

> sqlContext.table() should be able to take a database name as an additional 
> argument.
> 
>
> Key: SPARK-8107
> URL: https://issues.apache.org/jira/browse/SPARK-8107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Doug Balog
>
> sqlContext table should take an addition argument to specify a databaseName.
> {code}
> def table(databaseName: String, tableName: String): DataFrame =
>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7747) Document spark.sql.planner.externalSort option

2015-06-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575193#comment-14575193
 ] 

Yin Huai edited comment on SPARK-7747 at 6/5/15 8:47 PM:
-

It has been fixed by https://github.com/apache/spark/pull/6272. 

Since it is a doc change, I am setting the fix version as 1.4.0 since we have 
not released 1.4.0 and our doc is not necessarily build based on the last RC.


was (Author: yhuai):
It has been fixed by https://github.com/apache/spark/pull/6272. 

Since it is a doc change, I am setting the fix version as 1.4.0.

> Document spark.sql.planner.externalSort option
> --
>
> Key: SPARK-7747
> URL: https://issues.apache.org/jira/browse/SPARK-7747
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.1
>Reporter: Luca Martinetti
>Priority: Minor
> Fix For: 1.4.0
>
>
> The configuration option *spark.sql.planner.externalSort* introduced in 
> SPARK-4410 is not documented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7747) Document spark.sql.planner.externalSort option

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7747.
-
Resolution: Fixed

It has been fixed by https://github.com/apache/spark/pull/6272. 

Since it is a doc change, I am setting the fix version as 1.4.0.

> Document spark.sql.planner.externalSort option
> --
>
> Key: SPARK-7747
> URL: https://issues.apache.org/jira/browse/SPARK-7747
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.1
>Reporter: Luca Martinetti
>Priority: Minor
>
> The configuration option *spark.sql.planner.externalSort* introduced in 
> SPARK-4410 is not documented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7747) Document spark.sql.planner.externalSort option

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7747:

Fix Version/s: 1.4.0

> Document spark.sql.planner.externalSort option
> --
>
> Key: SPARK-7747
> URL: https://issues.apache.org/jira/browse/SPARK-7747
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.1
>Reporter: Luca Martinetti
>Priority: Minor
> Fix For: 1.4.0
>
>
> The configuration option *spark.sql.planner.externalSort* introduced in 
> SPARK-4410 is not documented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7991) Python DataFrame: support passing a list into describe

2015-06-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7991.

   Resolution: Fixed
Fix Version/s: 1.4.1
 Assignee: Amey Chaugule

> Python DataFrame: support passing a list into describe
> --
>
> Key: SPARK-7991
> URL: https://issues.apache.org/jira/browse/SPARK-7991
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Amey Chaugule
>  Labels: starter
> Fix For: 1.4.1
>
>
> DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way:
> {code}
> df.describe('col1', 'col2', 'col3')
> {code}
> Most of our DataFrame functions accept a list in addition to varargs. 
> describe should do the same, i.e. it should also accept a Python list:
> {code}
> df.describe(['col1', 'col2', 'col3'])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7991) Python DataFrame: support passing a list into describe

2015-06-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7991:
---
Fix Version/s: 1.5.0

> Python DataFrame: support passing a list into describe
> --
>
> Key: SPARK-7991
> URL: https://issues.apache.org/jira/browse/SPARK-7991
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Amey Chaugule
>  Labels: starter
> Fix For: 1.4.1, 1.5.0
>
>
> DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way:
> {code}
> df.describe('col1', 'col2', 'col3')
> {code}
> Most of our DataFrame functions accept a list in addition to varargs. 
> describe should do the same, i.e. it should also accept a Python list:
> {code}
> df.describe(['col1', 'col2', 'col3'])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8128) Dataframe Fails to Recognize Column in Schema

2015-06-05 Thread Brad Willard (JIRA)
Brad Willard created SPARK-8128:
---

 Summary: Dataframe Fails to Recognize Column in Schema
 Key: SPARK-8128
 URL: https://issues.apache.org/jira/browse/SPARK-8128
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.3.1
Reporter: Brad Willard


I'm loading a folder of parquet files with about 600 parquet files and loading 
it into one dataframe so schema merging is involved. There is some bug with the 
schema merging that you print the schema and it shows and attributes. However 
when you run a query and filter on that attribute is errors saying it's not in 
the schema.

I think this bug could be related to an attribute name being reused in a nested 
object.  "mediaProcessingState" appears twice in the schema and is the problem.

sdf = sql_context.parquet('/parquet/big_data_folder')
sdf.printSchema()
root
 |-- _id: string (nullable = true)
 |-- addedOn: string (nullable = true)
 |-- attachment: string (nullable = true)
 ...
|-- items: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- _id: string (nullable = true)
 |||-- addedOn: string (nullable = true)
 |||-- authorId: string (nullable = true)
 |||-- mediaProcessingState: long (nullable = true)
 |-- mediaProcessingState: long (nullable = true)
 |-- title: string (nullable = true)
 |-- key: string (nullable = true)

sdf.filter(sdf.mediaProcessingState == 3).count()

causes this exception

Py4JJavaError: An error occurred while calling o67.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 
in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in stage 
4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: Column 
[mediaProcessingState] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExe

[jira] [Updated] (SPARK-8128) Dataframe Fails to Recognize Column in Schema

2015-06-05 Thread Brad Willard (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Willard updated SPARK-8128:

Description: 
I'm loading a folder of parquet files with about 600 parquet files and loading 
it into one dataframe so schema merging is involved. There is some bug with the 
schema merging that you print the schema and it shows and attributes. However 
when you run a query and filter on that attribute is errors saying it's not in 
the schema.

I think this bug could be related to an attribute name being reused in a nested 
object.  "mediaProcessingState" appears twice in the schema and is the problem.

sdf = sql_context.parquet('/parquet/big_data_folder')
sdf.printSchema()
root
 |-- _id: string (nullable = true)
 |-- addedOn: string (nullable = true)
 |-- attachment: string (nullable = true)
 ...
|-- items: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- _id: string (nullable = true)
 |||-- addedOn: string (nullable = true)
 |||-- authorId: string (nullable = true)
 |||-- mediaProcessingState: long (nullable = true)
 |-- mediaProcessingState: long (nullable = true)
 |-- title: string (nullable = true)
 |-- key: string (nullable = true)

sdf.filter(sdf.mediaProcessingState == 3).count()

causes this exception

Py4JJavaError: An error occurred while calling o67.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 
in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in stage 
4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: Column 
[mediaProcessingState] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


You also g

[jira] [Updated] (SPARK-5784) Add StatsDSink to MetricsSystem

2015-06-05 Thread Vidhya Arvind (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vidhya Arvind updated SPARK-5784:
-
Attachment: statsd.patch

Attaching patch file. Not sure how to get this jira reopened and patch added

> Add StatsDSink to MetricsSystem
> ---
>
> Key: SPARK-5784
> URL: https://issues.apache.org/jira/browse/SPARK-5784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Ryan Williams
>Priority: Minor
> Attachments: statsd.patch
>
>
> [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it 
> would be useful to support sending metrics to StatsD in addition to [the 
> existing Graphite 
> support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala].
> [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a 
> StatsD adapter for the 
> [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark 
> uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves 
> {{metrics-statsd}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7041) Avoid writing empty files in BypassMergeSortShuffleWriter

2015-06-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7041:
--
Description: In BypassMergeSortShuffleWriter, we may end up opening disk 
writers files for empty partitions; this occurs because we manually call 
{{open()}} after creating the writer, causing serialization and compression 
input streams to be created; these streams may write headers to the output 
stream, resulting in non-zero-length files being created for partitions that 
contain no records.  This is unnecessary, though, since the disk object writer 
will automatically open itself when the first write is performed.  Removing 
this eager {{open()}} call and rewriting the consumers to cope with the 
non-existence of empty files results in a large performance benefit for certain 
sparse workloads when using sort-based shuffle.  (was: In ExternalSorter, we 
may end up opening disk writers files for empty partitions; this occurs because 
we manually call {{open()}} after creating the writer, causing serialization 
and compression input streams to be created; these streams may write headers to 
the output stream, resulting in non-zero-length files being created for 
partitions that contain no records.  This is unnecessary, though, since the 
disk object writer will automatically open itself when the first write is 
performed.  Removing this eager {{open()}} call and rewriting the consumers to 
cope with the non-existence of empty files results in a large performance 
benefit for certain sparse workloads when using sort-based shuffle.)

> Avoid writing empty files in BypassMergeSortShuffleWriter
> -
>
> Key: SPARK-7041
> URL: https://issues.apache.org/jira/browse/SPARK-7041
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In BypassMergeSortShuffleWriter, we may end up opening disk writers files for 
> empty partitions; this occurs because we manually call {{open()}} after 
> creating the writer, causing serialization and compression input streams to 
> be created; these streams may write headers to the output stream, resulting 
> in non-zero-length files being created for partitions that contain no 
> records.  This is unnecessary, though, since the disk object writer will 
> automatically open itself when the first write is performed.  Removing this 
> eager {{open()}} call and rewriting the consumers to cope with the 
> non-existence of empty files results in a large performance benefit for 
> certain sparse workloads when using sort-based shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7041) Avoid writing empty files in BypassMergeSortShuffleWriter

2015-06-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7041:
--
Summary: Avoid writing empty files in BypassMergeSortShuffleWriter  (was: 
Avoid writing empty files in ExternalSorter)

> Avoid writing empty files in BypassMergeSortShuffleWriter
> -
>
> Key: SPARK-7041
> URL: https://issues.apache.org/jira/browse/SPARK-7041
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In ExternalSorter, we may end up opening disk writers files for empty 
> partitions; this occurs because we manually call {{open()}} after creating 
> the writer, causing serialization and compression input streams to be 
> created; these streams may write headers to the output stream, resulting in 
> non-zero-length files being created for partitions that contain no records.  
> This is unnecessary, though, since the disk object writer will automatically 
> open itself when the first write is performed.  Removing this eager 
> {{open()}} call and rewriting the consumers to cope with the non-existence of 
> empty files results in a large performance benefit for certain sparse 
> workloads when using sort-based shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)
Kan Zhang created SPARK-8129:


 Summary: Securely pass auth secret to executors in standalone 
cluster mode
 Key: SPARK-8129
 URL: https://issues.apache.org/jira/browse/SPARK-8129
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Reporter: Kan Zhang
Priority: Critical


Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,

```
ps -ef

..

  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
-*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
``` 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8129:
---

Assignee: Apache Spark

> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Assignee: Apache Spark
>Priority: Critical
>
> Currently, when authentication is turned on, Worker passes auth secret to 
> executors (also drivers in cluster mode) as java options on the command line, 
> which isn't secure. The passed secret can be seen by anyone running 'ps' 
> command, e.g.,
> ```
> ps -ef
> ..
>   501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> -*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8129:
---

Assignee: (was: Apache Spark)

> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, Worker passes auth secret to 
> executors (also drivers in cluster mode) as java options on the command line, 
> which isn't secure. The passed secret can be seen by anyone running 'ps' 
> command, e.g.,
> ```
> ps -ef
> ..
>   501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> -*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575323#comment-14575323
 ] 

Apache Spark commented on SPARK-8129:
-

User 'kanzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6676

> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, Worker passes auth secret to 
> executors (also drivers in cluster mode) as java options on the command line, 
> which isn't secure. The passed secret can be seen by anyone running 'ps' 
> command, e.g.,
> ```
> ps -ef
> ..
>   501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> -*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8130) spark.files.useFetchCache should be off by default

2015-06-05 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8130:
-

 Summary: spark.files.useFetchCache should be off by default
 Key: SPARK-8130
 URL: https://issues.apache.org/jira/browse/SPARK-8130
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Minor


I noticed that {{spark.files.useFetchCache}} is on by default, but I think that 
we should turn it off by default since it's unlikely to improve performance for 
all but a handful of users and adds some extra IO / complexity (albeit only at 
executor / app startup).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8131) Improve Database support

2015-06-05 Thread Yin Huai (JIRA)
Yin Huai created SPARK-8131:
---

 Summary: Improve Database support
 Key: SPARK-8131
 URL: https://issues.apache.org/jira/browse/SPARK-8131
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical


This is the master jira for tracking the improvement on database support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8131) Improve Database support

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8131:

Target Version/s: 1.5.0

> Improve Database support
> 
>
> Key: SPARK-8131
> URL: https://issues.apache.org/jira/browse/SPARK-8131
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> This is the master jira for tracking the improvement on database support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7943) saveAsTable in DataFrameWriter can only add table to DataBase “default”

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7943:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-8131

> saveAsTable in DataFrameWriter can only add table to DataBase “default”
> ---
>
> Key: SPARK-7943
> URL: https://issues.apache.org/jira/browse/SPARK-7943
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: baishuo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8105) sqlContext.table("databaseName.tableName") broke with SPARK-6908

2015-06-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8105:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-8131

> sqlContext.table("databaseName.tableName") broke with SPARK-6908
> 
>
> Key: SPARK-8105
> URL: https://issues.apache.org/jira/browse/SPARK-8105
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Spark with Hive
>Reporter: Doug Balog
>Priority: Critical
>
> Since the introduction of Dataframes in Spark 1.3.0 and prior to SPARK-6908 
> landing into master, a user could get a DataFrame to a Hive table using 
> `sqlContext.table("databaseName.tableName")` 
> Since SPARK-6908, the user now receives a NoSuchTableException.
> This amounts to a change in  non experimental sqlContext.table() api and will 
> require user code to be modified to work properly with 1.4.0.
> The only viable work around I could find is
> `sqlContext.sql("select * from databseName.tableName")`
> which seems like a hack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8132) Race condition if task is cancelled with interruption while fetching file dependencies

2015-06-05 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8132:
-

 Summary: Race condition if task is cancelled with interruption 
while fetching file dependencies
 Key: SPARK-8132
 URL: https://issues.apache.org/jira/browse/SPARK-8132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.0
Reporter: Josh Rosen


This is a borderline impossible-to-reproduce bug:

If {{spark.files.overwrite = false}} (the default) and a Spark executor is 
fetching large file dependencies from the driver _and_ the first task that 
triggered file dependency loading is cancelled after it has started copying / 
moving the downloaded file to its target directory, then the executor may be 
put into a bad state where all subsequent tasks fail with errors about refusing 
to overwrite an existing file because its contents differ from the file being 
fetched.

There are a few ways to mitigate this:

- Set {{spark.files.overwrite = false}}.  We should probably remove or 
deprecate this configuration: the only reason that it was added was to work 
around an obscure Spark 0.8-era bug where Spark would delete files out of the 
driver's CWD when running tasks in local mode.  This concern may have been 
mitigated by other changes.  Regardless, there are many environments where this 
feature can safely be disabled.
- Disable {{spark.files.useFetchCache}}, which should probably be off by 
default (see SPARK-8130); this will shorten the window over which the race can 
occur.
- Catch InterruptedException and perform cleanup in our file moving / copying 
code; this is somewhat tricky to reason about / get right because the right 
behavior differs based on whether we're overwriting or creating a new file.

Given that this can be fixed with conf changes for the cases that i've seen, 
I'm not sure that this needs to be a high-priority fix, although I would be 
glad to review patches to clean up / audit this code to properly fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3542) If spark.authenticate.secret is set it's transferred in plain text

2015-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3542.
--
Resolution: Not A Problem

> If spark.authenticate.secret is set it's transferred in plain text
> --
>
> Key: SPARK-3542
> URL: https://issues.apache.org/jira/browse/SPARK-3542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: James Livingston
>
> It is already noted in the SecurityManager API docs but when using the Akka 
> communication protocol, SSL is not currently supported and credentials can 
> (and often are) passed in plaintext.
> Using one of the examples, you can add this and see "password" sent in 
> plaintext via the akka.tcp protocol:
>   conf.set("spark.authenticate", "true")
>   conf.set("spark.authenticate.secret", "password")
> It's obviously known, but worth having a jira to track.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >