[jira] [Comment Edited] (SPARK-17890) scala.ScalaReflectionException

2016-10-13 Thread Khalid Reid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572217#comment-15572217
 ] 

Khalid Reid edited comment on SPARK-17890 at 10/13/16 3:29 PM:
---

Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:21)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint$Main$1(Main.scala:21)
at Main$delayedInit$body.apply(Main.scala:5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:5)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}



was (Author: kor):
Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:20)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint$Main$1(Main.scala:20)
at Main$dela

[jira] [Created] (SPARK-17911) Scheduler does not messageScheduler for ResubmitFailedStages

2016-10-13 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-17911:


 Summary: Scheduler does not messageScheduler for 
ResubmitFailedStages
 Key: SPARK-17911
 URL: https://issues.apache.org/jira/browse/SPARK-17911
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.0.0
Reporter: Imran Rashid


Its not totally clear what the purpose of the {{messageScheduler}} is in 
{{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
should just clearly document its purpose.

This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11

But its tricky so breaking it out here for archiving the discussion.

Note: this issue requires a decision on what to do before a code change, so 
lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17890) scala.ScalaReflectionException

2016-10-13 Thread Khalid Reid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572217#comment-15572217
 ] 

Khalid Reid edited comment on SPARK-17890 at 10/13/16 3:30 PM:
---

Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF   

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here. Things work if I 
remove the toDF call

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:21)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint$Main$1(Main.scala:21)
at Main$delayedInit$body.apply(Main.scala:5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:5)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}



was (Author: kor):
Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:21)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint

[jira] [Updated] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-17911:
-
Summary: Scheduler does not need messageScheduler for ResubmitFailedStages  
(was: Scheduler does not messageScheduler for ResubmitFailedStages)

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish commented on SPARK-17908:


Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: [columns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>

[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-13 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572289#comment-15572289
 ] 

Imran Rashid commented on SPARK-17911:
--

Copying the earlier discussion on the PR here

from squito:
bq.I find myself frequently wondering about the purpose of this. Its commented 
very tersely on RESUBMIT_TIMEOUT, but I think it might be nice to add a longer 
comment here. I guess something like
bq. "If we get one fetch-failure, we often get more fetch failures across 
multiple executors. We will get better parallelism when we resubmit the 
mapStage if we can resubmit when we know about as many of those failures as 
possible. So this is a heuristic to add a small delay to see if we gather a few 
more failures before we resubmit."
bq.We do not need the delay to figure out exactly which shuffle-map outputs are 
gone on the executor -- we always mark the executor as lost on a fetch failure, 
which means we mark all its map output as gone. (This is really confusing -- it 
looks like we only remove the one shuffle-map output that was involved in the 
fetch failure, but then the entire removal is buried inside another method a 
few lines further.)
bq.I did some browsing through history, and there used to be this comment
{noformat}
 // Periodically resubmit failed stages if some map output fetches have 
failed and we have
 // waited at least RESUBMIT_TIMEOUT. We wait for this short time because 
when a node fails,
 // tasks on many other nodes are bound to get a fetch failure, and they 
won't all get it at
 // the same time, so we want to make sure we've identified all the reduce 
tasks that depend
 // on the failed node.
{noformat}
bq. at least in the current version, this also sounds like a bad reason to have 
the delay. failedStage won't be resubmitted till mapStage completes anyway, and 
then it'll look to see what tasks it is missing. Adding a tiny delay on top of 
the natural delay for mapStage seems pretty pointless.
bq. I don't even think that the reason I gave in my suggested comment is a good 
one -- do you really expect failures in multiple executors? But it is at least 
logically consistent.

from markhamstra
bq. I don't like "Periodically" in your suggested comment, since this is a 
one-shot action after a delay of RESUBMIT_TIMEOUT milliseconds.
bq.I agree that this delay-before-resubmit logic is suspect. If we are both 
thinking correctly that a 200 ms delay on top of the time to re-run the 
mapStage is all but inconsequential, then removing it in this PR would be fine. 
If there are unanticipated consequences, though, I'd prefer to have that change 
in a separate PR.

from squito
bq. yeah probably a separate PR, sorry this was just an opportunity for me to 
rant :)
bq.And sorry if I worded it poorly, but I was not suggesting the one w/ 
"Periodically" as a better comment -- in fact I think its a bad comment, just 
wanted to mention it was another description which used to be there long ago.
bq. This was my suggestion:
bq. If we get one fetch-failure, we often get more fetch failures across 
multiple executors. We will get better parallelism when we resubmit the 
mapStage if we can resubmit when we know about as many of those failures as 
possible. So this is a heuristic to add a small delay to see if we gather a few 
more failures before we resubmit.

from markhamstra
bq. Ah, sorry for ascribing the prior comment to your preferences. That comment 
actually did make sense a long time ago when the resubmitting of stages really 
was done periodically by an Akka scheduled event that fired every something 
seconds. I'm pretty sure the RESUBMIT_TIMEOUT stuff is also legacy code that 
doesn't make sense and isn't necessary any more.
bq. So, do you want to do the follow-up PR to get rid of it, or shall I?
bq. BTW, nothing wrong with your wording -- but my poor reading can create 
misunderstanding of even the clearest text.
from squito
bq. if you are willing, could you please file the follow up? I am bouncing 
between various things in my backlog -- though that change is small, I have a 
feeling it will be merit extra discussion as a risky change, would be great if 
you drive it
bq. Ok, I can get started on that. I believe that leaves this PR ready to merge.
bq. I think that I am going to backtrack on creating a new PR, because I think 
that the RESUBMIT_TIMEOUT actually does still make sense.
bq. If we go way back in DAGScheduler history 
(https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala)
 we'll find that we had an event queue that was polled every 10 millis 
(POLL_TIMEOUT) and that fetch failures didn't produce separate resubmit tasks 
events, but rather we called resubmitFailedStages within the event 
polling/handling loop if 50 millis (RESUBMIT_TIMEOUT) had passed since the last 
time a FetchFailed was receiv

[jira] [Comment Edited] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-13 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572289#comment-15572289
 ] 

Imran Rashid edited comment on SPARK-17911 at 10/13/16 3:44 PM:


Copying the earlier discussion on the PR here

from squito:
bq.I find myself frequently wondering about the purpose of this. Its commented 
very tersely on RESUBMIT_TIMEOUT, but I think it might be nice to add a longer 
comment here. I guess something like
bq. "If we get one fetch-failure, we often get more fetch failures across 
multiple executors. We will get better parallelism when we resubmit the 
mapStage if we can resubmit when we know about as many of those failures as 
possible. So this is a heuristic to add a small delay to see if we gather a few 
more failures before we resubmit."
bq.We do not need the delay to figure out exactly which shuffle-map outputs are 
gone on the executor -- we always mark the executor as lost on a fetch failure, 
which means we mark all its map output as gone. (This is really confusing -- it 
looks like we only remove the one shuffle-map output that was involved in the 
fetch failure, but then the entire removal is buried inside another method a 
few lines further.)
bq.I did some browsing through history, and there used to be this comment
{noformat}
 // Periodically resubmit failed stages if some map output fetches have 
failed and we have
 // waited at least RESUBMIT_TIMEOUT. We wait for this short time because 
when a node fails,
 // tasks on many other nodes are bound to get a fetch failure, and they 
won't all get it at
 // the same time, so we want to make sure we've identified all the reduce 
tasks that depend
 // on the failed node.
{noformat}
bq. at least in the current version, this also sounds like a bad reason to have 
the delay. failedStage won't be resubmitted till mapStage completes anyway, and 
then it'll look to see what tasks it is missing. Adding a tiny delay on top of 
the natural delay for mapStage seems pretty pointless.
bq. I don't even think that the reason I gave in my suggested comment is a good 
one -- do you really expect failures in multiple executors? But it is at least 
logically consistent.

from markhamstra
bq. I don't like "Periodically" in your suggested comment, since this is a 
one-shot action after a delay of RESUBMIT_TIMEOUT milliseconds.
bq.I agree that this delay-before-resubmit logic is suspect. If we are both 
thinking correctly that a 200 ms delay on top of the time to re-run the 
mapStage is all but inconsequential, then removing it in this PR would be fine. 
If there are unanticipated consequences, though, I'd prefer to have that change 
in a separate PR.

from squito
bq. yeah probably a separate PR, sorry this was just an opportunity for me to 
rant :)
bq.And sorry if I worded it poorly, but I was not suggesting the one w/ 
"Periodically" as a better comment -- in fact I think its a bad comment, just 
wanted to mention it was another description which used to be there long ago.
bq. This was my suggestion:
bq. If we get one fetch-failure, we often get more fetch failures across 
multiple executors. We will get better parallelism when we resubmit the 
mapStage if we can resubmit when we know about as many of those failures as 
possible. So this is a heuristic to add a small delay to see if we gather a few 
more failures before we resubmit.

from markhamstra
bq. Ah, sorry for ascribing the prior comment to your preferences. That comment 
actually did make sense a long time ago when the resubmitting of stages really 
was done periodically by an Akka scheduled event that fired every something 
seconds. I'm pretty sure the RESUBMIT_TIMEOUT stuff is also legacy code that 
doesn't make sense and isn't necessary any more.
bq. So, do you want to do the follow-up PR to get rid of it, or shall I?
bq. BTW, nothing wrong with your wording -- but my poor reading can create 
misunderstanding of even the clearest text.
from squito
bq. if you are willing, could you please file the follow up? I am bouncing 
between various things in my backlog -- though that change is small, I have a 
feeling it will be merit extra discussion as a risky change, would be great if 
you drive it
from markhamstra
bq. Ok, I can get started on that. I believe that leaves this PR ready to merge.
from markhamstra
bq. I think that I am going to backtrack on creating a new PR, because I think 
that the RESUBMIT_TIMEOUT actually does still make sense.
bq. If we go way back in DAGScheduler history 
(https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala)
 we'll find that we had an event queue that was polled every 10 millis 
(POLL_TIMEOUT) and that fetch failures didn't produce separate resubmit tasks 
events, but rather we called resubmitFailedStages within the event 
polling/handling loop if 

[jira] [Created] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2016-10-13 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-17912:


 Summary: Refactor code generation to get data for 
ColumnVector/ColumnarBatch
 Key: SPARK-17912
 URL: https://issues.apache.org/jira/browse/SPARK-17912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Kazuaki Ishizaki


Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
becoming pervasive. The code generation part can be reused by multiple 
components (e.g. parquet reader, data cache, and so on).
This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572330#comment-15572330
 ] 

Sean Owen commented on SPARK-17908:
---

Yes what's your code? This says one DF has just 'columns' as column. 

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

2016-10-13 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572344#comment-15572344
 ] 

Imran Rashid commented on SPARK-17911:
--

bq. In other words, handling a ResubmitFailedStages event should be quick, and 
causes failedStages to be cleared, allowing the next ResubmitFailedStages event 
to be posted from the handling of another FetchFailed. If there are the 
expected lot of fetch failures for a single stage, and there is no 
RESUBMIT_TIMEOUT, then it is quite likely that there will be a burst of 
resubmit events (and corresponding log messages) and submitStage calls made in 
rapid succession.

Lemme rephrase your comment to make sure I understand it.

The messageScheduler and delay do *not* effective correctness, or even what 
actually gets resubmitted.  When we resubmit {{mapStage}}, we always resubmit 
all tasks corresponding to shuffle map output on the failed executor.  And when 
we resubmit the {{failedStage}}, there is probably a long enough delay from 
{{mapStage}} that waiting 200ms is relatively inconsequential.

However, it *does* effect the logging.  If the scheduler event queue is 
relatively empty, then as the fetch failures trickle in, for each one we'd post 
a Resubmit event which gets handled relatively quickly.  So each fetch failure 
would trigger another resubmit event and more logging.  Which is also 
undesirable, both because of the noise in the logs, and b/c it would creates an 
unnecessary flood of events on the scheduler event queue.

Is that a fair summary?

I agree with everything said there, but then I'd request we take one of two 
actions:

1) change the resubmit logic -- in addition to checking failed stages, you can 
also check {{waitingStages}} and {{runningStages}}.  that is what happens 
eventually anyway inside {{resubmitFailedStages}}.  This would actually be even 
better for decreasing noise in the logs etc.
I know this may seem like a small thing to make such a big deal about, but I 
honestly think this is confusing enough that its worth cleaning up -- 
eliminating an unneeded event queue I think is a significant win.

2) If we don't do that, lets at least add in a better comment explaining the 
purpose.  (Maybe just a pointer to this jira at this point)

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:09 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: [columns];
at 
org.apache.spark.sql.ca

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:13 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'  and df coumns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 't

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:12 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'  and df1 coumns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', '

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572373#comment-15572373
 ] 

Harish commented on SPARK-17908:


Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names.

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572377#comment-15572377
 ] 

Shivaram Venkataraman commented on SPARK-17904:
---

+1 I think this sounds good [~yanboliang]. We could also offer a contract to 
load these packages before running UDFs ? Otherwise the UDFs need to use 
`library(Matrix)` in every function.

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572373#comment-15572373
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:17 PM:
--

Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names. I am confirming  i 
can see the column name in the columns list eg: key2#20202   missing from 
key2#20202, key1#key2#23723, key3#20342 etc


was (Author: harishk15):
Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names.

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572394#comment-15572394
 ] 

Apache Spark commented on SPARK-17192:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15467

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14561) History Server does not see new logs in S3

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572401#comment-15572401
 ] 

Steve Loughran commented on SPARK-14561:


To clarify: it's not changes in existing files that aren't showing up, *it is 
new files added to the same destination directory*


If that's the case, something is up with the scanning

#. set the logging of  org.apache.spark.deploy.history.FsHistoryProvider  to 
debug
# have a look at the scan interval. Is it too long? 

> History Server does not see new logs in S3
> --
>
> Key: SPARK-14561
> URL: https://issues.apache.org/jira/browse/SPARK-14561
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Miles Crawford
>
> If you set the Spark history server to use a log directory with an s3a:// 
> url, everything appears to work fine at first, but new log files written by 
> applications are not picked up by the server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572414#comment-15572414
 ] 

Steve Loughran commented on SPARK-9004:
---

HADOOP-13605 added a whole new set of counters for HDFS, S3 and hopefully soon 
Azure; there's an API call on the FS {{getStorageStatistics()}} to query these.

One problem though: this isn't shipping in Hadoop branch-2 yet, so you can't 
write code that uses it, not unless there's some introspection/plugin 
mechanism. 

All the stats are just {{name: String -> value: Long}}, so a something to 
collect a {{Map[String, Long]}} would work. 

> Add s3 bytes read/written metrics
> -
>
> Key: SPARK-9004
> URL: https://issues.apache.org/jira/browse/SPARK-9004
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Abhishek Modi
>Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate 
> data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572421#comment-15572421
 ] 

Sean Owen commented on SPARK-17908:
---

You must be doing something different than what you show, because what you show 
doesn't even quite compile. Here I adapted your example from the Python example 
in the docs and ran it succesfully on 2.0.1:

{code}
import pyspark.sql.functions as func
from pyspark.sql.types import *

sc = spark.sparkContext
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: (p[0], p[1].strip()))
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in 
schemaString.split()]
schema = StructType(fields)
df = spark.createDataFrame(people, schema)

df1 = df.groupBy('name', 'age').agg(func.count(func.col('age')).alias('total'))
df3 = df.join(df1, ['name', 'age']).withColumn('newcol', 
func.col('age')/func.col('total'))
{code}


> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17714) ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName 

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572435#comment-15572435
 ] 

Sean Owen commented on SPARK-17714:
---

This is resolved now, right?

> ClassCircularityError is thrown when using 
> org.apache.spark.util.Utils.classForName 
> 
>
> Key: SPARK-17714
> URL: https://issues.apache.org/jira/browse/SPARK-17714
> Project: Spark
>  Issue Type: Bug
>Reporter: Weiqing Yang
>
> This jira is a follow up to [SPARK-15857| 
> https://issues.apache.org/jira/browse/SPARK-15857] .
> Task invokes CallerContext. SetCurrentContext() to set its callerContext to 
> HDFS. In SetCurrentContext(), it tries looking for class 
> {{org.apache.hadoop.ipc.CallerContext}} by using 
> {{org.apache.spark.util.Utils.classForName}}. This causes 
> ClassCircularityError to be thrown when running ReplSuite in master Maven 
> builds (The same tests pass in the SBT build). A hotfix 
> [SPARK-17710|https://issues.apache.org/jira/browse/SPARK-17710] has been made 
> by using Class.forName instead, but it needs further investigation.
> Error:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.3/2000/testReport/junit/org.apache.spark.repl/ReplSuite/simple_foreach_with_accumulator/
> {code}
> scala> accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, 
> name: None, value: 0)
> scala> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.ClassCircularityError: 
> io/netty/util/internal/_matchers_/org/apache/spark/network/protocol/MessageMatcher
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.Cl

[jira] [Commented] (SPARK-12571) AWS credentials not available for read.parquet in SQLContext

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572442#comment-15572442
 ] 

Steve Loughran commented on SPARK-12571:


Means the credentials aren't at the far end, either in the XML (or, later 
Hadoop versions: env vars, IAM role data)

How was the code deployed? Standalone? Or via YARN?

> AWS credentials not available for read.parquet in SQLContext
> 
>
> Key: SPARK-12571
> URL: https://issues.apache.org/jira/browse/SPARK-12571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: repeated with s3n and s3a on hadoop 2.6 and hadoop 2.7.1
>Reporter: Kostiantyn Kudriavtsev
>
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
> at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
> at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
> at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2016-10-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572468#comment-15572468
 ] 

Steve Loughran commented on SPARK-8437:
---

Just came across by way of comments in the source. This *shouldn't* happen; the 
glob code ought to recognise when there is no wildcard and exit early —faster 
than if there was a wildcard. If its taking longer, then the full list process 
is making a mess of things, or somehow the result generation is playing up. If 
it was S3 only I'd blame the S3 APIs, but this sounds like S3 just amplifies a 
problem which may exist already

How big was the directory where this surfaced? Was it deep, wide or deep & wide?

> Using directory path without wildcard for filename slow for large number of 
> files with wholeTextFiles and binaryFiles
> -
>
> Key: SPARK-8437
> URL: https://issues.apache.org/jira/browse/SPARK-8437
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.3.1, 1.4.0
> Environment: Ubuntu 15.04 + local filesystem
> Amazon EMR + S3 + HDFS
>Reporter: Ewan Leith
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.4.1, 1.5.0
>
>
> When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
> of files in it, Spark hangs for a few minutes before processing the files.
> If you add a * to the end of the path, there is no delay.
> This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
> on S3.
> To reproduce, create a directory with 50,000 files in it, then run:
> val a = sc.binaryFiles("file:/path/to/files/")
> a.count()
> val b = sc.binaryFiles("file:/path/to/files/*")
> b.count()
> and monitor the different startup times.
> For example, in the spark-shell these commands are pasted in together, so the 
> delay at f.count() is from 10:11:08 t- 10:13:29 to output "Total input paths 
> to process : 4", then until 10:15:42 to being processing files:
> scala> val f = sc.binaryFiles("file:/home/ewan/large/")
> 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
> curMem=0, maxMem=278019440
> 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 156.9 KB, free 265.0 MB)
> 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
> curMem=160616, maxMem=278019440
> 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.9 KB, free 265.0 MB)
> 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
> 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
> :21
> f: org.apache.spark.rdd.RDD[(String, 
> org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
> BinaryFileRDD[0] at binaryFiles at :21
> scala> f.count()
> 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
> allocation with : CompletedNodes: 1, size left: 0
> 15/06/18 10:15:42 INFO SparkContext: Starting job: count at :24
> 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at :24) with 
> 4 output partitions (allowLocal=false)
> 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()
> Adding a * to the end of the path removes the delay:
> scala> val f = sc.binaryFiles("file:/home/ewan/large/*")
> 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
> curMem=0, maxMem=278019440
> 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 156.9 KB, free 265.0 MB)
> 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
> curMem=160616, maxMem=278019440
> 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.9 KB, free 265.0 MB)
> 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
> 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
> :21
> f: org.apache.spark.rdd.RDD[(String, 
> org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* 
> BinaryFileRDD[0] at binaryFiles at :21
> scala> f.count()
> 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node 
> allocation wi

[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572472#comment-15572472
 ] 

Shivaram Venkataraman commented on SPARK-17902:
---

Good catch - Looks like this was changed in 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c 
which is a part of 1.6.0. Do you have a small test case that fails ?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572478#comment-15572478
 ] 

Harish commented on SPARK-17908:


Yes. You are code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  



> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy

2016-10-13 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572478#comment-15572478
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:58 PM:
--

Yes. Your code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  




was (Author: harishk15):
Yes. You are code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  



> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung commented on SPARK-17904:
--

I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-13 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572545#comment-15572545
 ] 

Gayathri Murali commented on SPARK-12664:
-

[~yanboliang] I am not working on this. Please feel free to take it

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung edited comment on SPARK-17904 at 10/13/16 5:09 PM:


I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.



was (Author: felixcheung):
I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung edited comment on SPARK-17904 at 10/13/16 5:15 PM:


I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/
that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/conda-for-r
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

Other examples:
https://msdn.microsoft.com/en-us/microsoft-r/deployr-admin-r-package-management
http://blog.revolutionanalytics.com/2014/10/introducing-rrt.html

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.



was (Author: felixcheung):
I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17714) ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName 

2016-10-13 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572580#comment-15572580
 ] 

Weiqing Yang commented on SPARK-17714:
--

Not yet, need to investigate more. Could we pull in people more familiar with 
the Repl classloader stuff? Thanks.

> ClassCircularityError is thrown when using 
> org.apache.spark.util.Utils.classForName 
> 
>
> Key: SPARK-17714
> URL: https://issues.apache.org/jira/browse/SPARK-17714
> Project: Spark
>  Issue Type: Bug
>Reporter: Weiqing Yang
>
> This jira is a follow up to [SPARK-15857| 
> https://issues.apache.org/jira/browse/SPARK-15857] .
> Task invokes CallerContext. SetCurrentContext() to set its callerContext to 
> HDFS. In SetCurrentContext(), it tries looking for class 
> {{org.apache.hadoop.ipc.CallerContext}} by using 
> {{org.apache.spark.util.Utils.classForName}}. This causes 
> ClassCircularityError to be thrown when running ReplSuite in master Maven 
> builds (The same tests pass in the SBT build). A hotfix 
> [SPARK-17710|https://issues.apache.org/jira/browse/SPARK-17710] has been made 
> by using Class.forName instead, but it needs further investigation.
> Error:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.3/2000/testReport/junit/org.apache.spark.repl/ReplSuite/simple_foreach_with_accumulator/
> {code}
> scala> accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, 
> name: None, value: 0)
> scala> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.ClassCircularityError: 
> io/netty/util/internal/_matchers_/org/apache/spark/network/protocol/MessageMatcher
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apa

[jira] [Created] (SPARK-17913) Filter/join expressions can return incorrect results when comparing strings to longs

2016-10-13 Thread Ming Beckwith (JIRA)
Ming Beckwith created SPARK-17913:
-

 Summary: Filter/join expressions can return incorrect results when 
comparing strings to longs
 Key: SPARK-17913
 URL: https://issues.apache.org/jira/browse/SPARK-17913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
Reporter: Ming Beckwith


Reproducer:

{code}
  case class E(subject: Long, predicate: String, objectNode: String)

  def test(sc: SparkContext) = {
val sqlContext: SQLContext = new SQLContext(sc)
import sqlContext.implicits._

val broken = List(
  (19157170390056969L, "right", 19157170390056969L),
  (19157170390056973L, "wrong", 19157170390056971L),
  (19157190254313477L, "wrong", 19157190254313475L),
  (19157180859056133L, "wrong", 19157180859056131L),
  (19157170390056969L, "number", 161),
  (19157170390056971L, "string", "a string"),
  (19157190254313475L, "string", "another string"),
  (19157180859056131L, "number", 191)
)

val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, 
b._3.toString)).toDF()
val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
val fixed = brokenDF.filter(brokenDF("subject").cast("string") === 
brokenDF("objectNode"))

println("* incorrect filter results *")
println(brokenFilter.show())
println("* correct filter results *")
println(fixed.show())

println("* both sides cast to double *")
println(brokenFilter.explain())
  }

Broken filter returns:

+-+-+-+
|  subject|predicate|   objectNode|
+-+-+-+
|19157170390056969|right|19157170390056969|
|19157170390056973|wrong|19157170390056971|
|19157190254313477|wrong|19157190254313475|
|19157180859056133|wrong|19157180859056131|
+-+-+-+
{code}

The physical plan shows both sides of the expression are being cast to Double 
before evaluation. So while comparing numbers to a string number appears to 
work in many cases, when the numbers are sufficiently large and close together 
there is enough loss of precision to cause incorrect results. 

{code}
== Physical Plan ==
Filter (cast(subject#0L as double) = cast(objectNode#2 as double))

After casting the left side into strings, the filter returns the expected 
result:

+-+-+-+
|  subject|predicate|   objectNode|
+-+-+-+
|19157170390056969|right|19157170390056969|
+-+-+-+
{code}

Expected behavior in this case is probably to choose one side and cast the 
other (compare string to string or long to long) instead of using a data type 
with less precision. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572602#comment-15572602
 ] 

Felix Cheung commented on SPARK-17895:
--

would you like to fix this?

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572606#comment-15572606
 ] 

Felix Cheung commented on SPARK-17904:
--

For reference these are the related PRs for Python for package management that 
have not seen traction:
https://github.com/apache/spark/pull/13599
https://github.com/apache/spark/pull/14180


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Oksana Romankova (JIRA)
Oksana Romankova created SPARK-17914:


 Summary: Spark SQL casting to TimestampType with nanosecond 
results in incorrect timestamp
 Key: SPARK-17914
 URL: https://issues.apache.org/jira/browse/SPARK-17914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Oksana Romankova


In some cases when timestamps contain nanoseconds they will be parsed 
incorrectly. 

Examples: 

"2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
"2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"

The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
assumes that only 6 digit fraction of a second will be passed.

With this being the case I would suggest either discarding nanoseconds 
automatically, or throw an exception prompting to pre-format timestamps to 
microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572691#comment-15572691
 ] 

Apache Spark commented on SPARK-17912:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15467

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-13 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572689#comment-15572689
 ] 

Hossein Falaki commented on SPARK-17902:


Thanks for the pointer [~shivaram]. I will submit it patch with a regression 
test. The only obvious side-effect of this bug, is that collected type will be 
String, while it should have been a Factor. What makes it bad is that it is in 
our documentation and it used to work, so it is a regression.

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17912:


Assignee: (was: Apache Spark)

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17912:


Assignee: Apache Spark

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17882) RBackendHandler swallowing errors

2016-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17882.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Resolved by https://github.com/apache/spark/pull/15375

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: James Shuster
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17882) RBackendHandler swallowing errors

2016-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17882:
--
Assignee: James Shuster

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2016-10-13 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-17915:


 Summary: Prepare ColumnVector implementation for UnsafeData
 Key: SPARK-17915
 URL: https://issues.apache.org/jira/browse/SPARK-17915
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Kazuaki Ishizaki


Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} and 
{{OffHeapColumnarVector}}, which are optimized for reading data from Parquet. 
If they get an array, an map, or an struct from a {{Unsafe}} related data 
structure, it is inefficient.
This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that is 
optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-10-13 Thread Weiluo Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572721#comment-15572721
 ] 

Weiluo Ren commented on SPARK-17895:


Sure. Just want to collect some comments on the example to be added to the 
description here. Or I can first create a PR and get comments when people 
review it.

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572730#comment-15572730
 ] 

Sean Owen commented on SPARK-17914:
---

I think this is a duplicate of one of a couple possible issues, like 
https://issues.apache.org/jira/browse/SPARK-14428

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572741#comment-15572741
 ] 

Shivaram Venkataraman commented on SPARK-17904:
---

Thanks all - This is a good discussion about the role of executors and package 
management. The way I was thinking about this is that this could be similar to 
the maven coordinates that we support in say `spark-shell` and how Spark 
ensures these JAR files are loaded before execution.

On that note will be better if we take this in as an argument to 
sparkR.session() ? As opposed to having a function that can be called at any 
point in the middle that is

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17915:


Assignee: (was: Apache Spark)

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17915:


Assignee: Apache Spark

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572747#comment-15572747
 ] 

Apache Spark commented on SPARK-17915:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15468

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-10-13 Thread Weiluo Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572726#comment-15572726
 ] 

Weiluo Ren commented on SPARK-17895:


[~junyangq] Could you please help fix the SparkR doc accordingly?

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Oksana Romankova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572758#comment-15572758
 ] 

Oksana Romankova commented on SPARK-17914:
--

You are correct. It is related to what has been proposed in SPARK-14428. 
However, current behavior is defective. If SPARK-14428 is not going to be 
approved to be supported, then at least the defect deserves  consideration to 
be addressed. 

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17916:
--

 Summary: CSV data source treats empty string as null no matter 
what nullValue option is
 Key: SPARK-17916
 URL: https://issues.apache.org/jira/browse/SPARK-17916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Hossein Falaki


When user configures {{nullValue}} in CSV data source, in addition to those 
values, all empty string values are also converted to null.

{code}
data:
col1,col2
1,"-"
2,""
{code}

{code}
spark.read.format("csv").option("nullValue", "-")
{code}

We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572759#comment-15572759
 ] 

Alessio commented on SPARK-15565:
-

Same problem happened again in Spark 2.0.1.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17827) StatisticsColumnSuite failures on big endian platforms

2016-10-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17827.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> StatisticsColumnSuite failures on big endian platforms
> --
>
> Key: SPARK-17827
> URL: https://issues.apache.org/jira/browse/SPARK-17827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: big endian
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>  Labels: big-endian
> Fix For: 2.1.0
>
>
> https://issues.apache.org/jira/browse/SPARK-17073
> introduces new tests/function that fails on big endian platforms
> Failing tests:
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> string column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> binary column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> columns with different types
>  org.apache.spark.sql.hive.StatisticsSuite.generate column-level statistics 
> and load them from hive metastore
> all fail in checkColStat eg: 
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:156)
>   at 
> org.apache.spark.sql.StatisticsTest$.checkColStat(StatisticsTest.scala:92)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:43)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:40)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1.apply$mcV$sp(StatisticsTest.scala:40)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:168)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite.withTable(StatisticsColumnSuite.scala:30)
>   at 
> org.apache.spark.sql.StatisticsTest$class.checkColStats(StatisticsTest.scala:33)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite.checkColStats(StatisticsColumnSuite.scala:30)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply$mcV$sp(StatisticsColumnSuite.scala:171)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply(StatisticsColumnSuite.scala:160)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply(StatisticsColumnSuite.scala:160)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17827) StatisticsColumnSuite failures on big endian platforms

2016-10-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17827:
--
Assignee: Pete Robbins

> StatisticsColumnSuite failures on big endian platforms
> --
>
> Key: SPARK-17827
> URL: https://issues.apache.org/jira/browse/SPARK-17827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: big endian
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>  Labels: big-endian
> Fix For: 2.1.0
>
>
> https://issues.apache.org/jira/browse/SPARK-17073
> introduces new tests/function that fails on big endian platforms
> Failing tests:
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> string column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> binary column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> columns with different types
>  org.apache.spark.sql.hive.StatisticsSuite.generate column-level statistics 
> and load them from hive metastore
> all fail in checkColStat eg: 
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:156)
>   at 
> org.apache.spark.sql.StatisticsTest$.checkColStat(StatisticsTest.scala:92)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:43)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:40)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1.apply$mcV$sp(StatisticsTest.scala:40)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:168)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite.withTable(StatisticsColumnSuite.scala:30)
>   at 
> org.apache.spark.sql.StatisticsTest$class.checkColStats(StatisticsTest.scala:33)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite.checkColStats(StatisticsColumnSuite.scala:30)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply$mcV$sp(StatisticsColumnSuite.scala:171)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply(StatisticsColumnSuite.scala:160)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite$$anonfun$7.apply(StatisticsColumnSuite.scala:160)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event

2016-10-13 Thread Mario Briggs (JIRA)
Mario Briggs created SPARK-17917:


 Summary: Convert 'Initial job has not accepted any resources..' 
logWarning to a SparkListener event
 Key: SPARK-17917
 URL: https://issues.apache.org/jira/browse/SPARK-17917
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mario Briggs


When supporting Spark on a multi-tenant shared large cluster with quotas per 
tenant, often a submitted taskSet might not get executors because quotas have 
been exhausted (or) resources unavailable. In these situations, firing a 
SparkListener event instead of just logging the issue (as done currently at 
https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192),
 would give applications/listeners an opportunity to handle this more 
appropriately as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event

2016-10-13 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572795#comment-15572795
 ] 

Mario Briggs commented on SPARK-17917:
--

would appreciate if the spark devs comment in whether they see this as a bad 
idea for some reason. 

I basically see add 2 events to SparkListener like
  onTaskStarved() and OnTaskUnStarved() - the latter fires only if 
onTaskStarved() fired in the first place for a taskSet

> Convert 'Initial job has not accepted any resources..' logWarning to a 
> SparkListener event
> --
>
> Key: SPARK-17917
> URL: https://issues.apache.org/jira/browse/SPARK-17917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mario Briggs
>
> When supporting Spark on a multi-tenant shared large cluster with quotas per 
> tenant, often a submitted taskSet might not get executors because quotas have 
> been exhausted (or) resources unavailable. In these situations, firing a 
> SparkListener event instead of just logging the issue (as done currently at 
> https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192),
>  would give applications/listeners an opportunity to handle this more 
> appropriately as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)
Alessio created SPARK-17918:
---

 Summary: Default Warehause location apparently in HDFS 
 Key: SPARK-17918
 URL: https://issues.apache.org/jira/browse/SPARK-17918
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
 Environment: Macintosh
Reporter: Alessio


It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

`16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.`

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehause location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

`16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.`

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehause location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1.
> `16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.`
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Summary: Default Warehouse location apparently in HDFS   (was: Default 
Warehause location apparently in HDFS )

> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17919:
--

 Summary: Make timeout to RBackend configurable in SparkR
 Key: SPARK-17919
 URL: https://issues.apache.org/jira/browse/SPARK-17919
 Project: Spark
  Issue Type: Story
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


I am working on a project where {{gapply()}} is being used with a large dataset 
that happens to be extremely skewed. On that skewed partition, the user 
function takes more than 2 hours to return and that turns out to be larger than 
the timeout that we hardcode in SparkR for backend connection.

{code}
connectBackend <- function(hostname, port, timeout = 6000) 
{code}

Ideally user should be able to reconfigure Spark and increase the timeout. It 
should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Environment: Mac OS X 10.11.6  (was: Macintosh)

> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*



  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse*




> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse*



  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*
{color}

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*



> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*


  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*



> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*


  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*




> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
> 'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*
{color}


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572885#comment-15572885
 ] 

Sean Owen commented on SPARK-17914:
---

Does the ISO8601 format support nanoseconds even? I thought we had a discussion 
about it but don't recall the conclusion

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17918.
---
Resolution: Duplicate

> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572922#comment-15572922
 ] 

Cody Koeninger commented on SPARK-17812:


Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're 
dealing with a stream that has multiple topics in it.  Agree we should break 
that out, focus on defining starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is 
orthogonal to specifying some offsets, you have a situation like this:

.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 
}, "topicbar" : { "0": 1234, "1": 4567 }}""")

where startingOffsetForRealzYo has a more reasonable name that conveys it is 
specifying starting offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, 
means you're going to have to pick a key that isn't a valid topic, or add yet 
another layer of indirection.  It also makes it harder to make the format 
consistent with SPARK-17829 (which seems like a good thing to keep consistent, 
I agree)

Obviously I think you should just change the name, but it's your show.





> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16575) partition calculation mismatch with sc.binaryFiles

2016-10-13 Thread Tarun Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572950#comment-15572950
 ] 

Tarun Kumar commented on SPARK-16575:
-

[~rxin] I have now added the support of openCostInBytes, similar to SQL (thanks 
for pointing out). It does now creates an optimized number of partitions. 
Request you to review and suggest. Once again, Thanks for your suggestion, It 
worked like a charm!

> partition calculation mismatch with sc.binaryFiles
> --
>
> Key: SPARK-16575
> URL: https://issues.apache.org/jira/browse/SPARK-16575
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Suhas
>Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines 
> folder has 1922 files.
>  Ex: {noformat}val binaryRDD = 
> sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
> - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 
> 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 
> version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Oksana Romankova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572957#comment-15572957
 ] 

Oksana Romankova edited comment on SPARK-17914 at 10/13/16 7:42 PM:


Sean, I can't find any evidence of ISO8601 not supporting nanoseconds. All it 
says that it supports fraction of a second that should be supplied following 
comma or dot. Different parsing libraries that support ISO8601 have different 
precision limitations. For instance in Python, datetime.strptime() only 
supports precision down to microseconds and will throw an exception if 
nanoseconds were supplied in input string. While it may not be ideal for those 
who need to be able to retain nanosecond precision after parsing, it is an 
acceptable behavior. Those who do not need to retain nanosecond precision can 
catch, or, preemptively, truncate input string. Spark sql 
DateTimeUtils.stringToTimestamp() doesn't throw, and doesn't truncate properly, 
which results in incorrect timestamp. In the example above, the acceptable 
truncation would be:

"2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.003456"
"2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.000345"



was (Author: oromank...@cardlytics.com):
Sean, I can't find any evidence of ISO8601 not supporting nanoseconds. All it 
says that it supports fraction of a second that should be supplied following 
comma or dot. Different parsing libraries that support ISO8601 have different 
precision limitations. For instance in Python, datetime.strptime() only 
supports precision down to microseconds and will throw an exception if 
nanoseconds were supplied in input string. While it may not be ideal for those 
who need to be able to retain nanosecond precision after parsing, it is an 
acceptable behavior. Those who do not need to retain nanosecond precision can 
catch, or, preemptively, truncate input string. Spark sql 
DateTimeUtils.stringToTimestamp() doesn't throw, and doesn't truncate properly, 
which results in incorrect timestamp. In the example above, the acceptable 
truncation would be:

```
"2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.003456"
"2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.000345"
```

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Oksana Romankova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572957#comment-15572957
 ] 

Oksana Romankova commented on SPARK-17914:
--

Sean, I can't find any evidence of ISO8601 not supporting nanoseconds. All it 
says that it supports fraction of a second that should be supplied following 
comma or dot. Different parsing libraries that support ISO8601 have different 
precision limitations. For instance in Python, datetime.strptime() only 
supports precision down to microseconds and will throw an exception if 
nanoseconds were supplied in input string. While it may not be ideal for those 
who need to be able to retain nanosecond precision after parsing, it is an 
acceptable behavior. Those who do not need to retain nanosecond precision can 
catch, or, preemptively, truncate input string. Spark sql 
DateTimeUtils.stringToTimestamp() doesn't throw, and doesn't truncate properly, 
which results in incorrect timestamp. In the example above, the acceptable 
truncation would be:

```
"2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.003456"
"2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.000345"
```

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)
James Norvell created SPARK-17920:
-

 Summary: HiveWriterContainer passes null configuration to 
serde.initialize, causing NullPointerException in AvroSerde when using 
avro.schema.url
 Key: SPARK-17920
 URL: https://issues.apache.org/jira/browse/SPARK-17920
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
 Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
Reporter: James Norvell
Priority: Minor


When HiveWriterContainer intializes a serde it explicitly passes null for the 
Configuration:

https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161

When attempting to write to a table stored as Avro with avro.schema.url set, 
this causes a NullPointerException when it tries to get the FileSystem for the 
URL:

https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153

Reproduction:
{noformat}
spark-sql> create external table avro_in (a string) stored as avro location 
'/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');

spark-sql> create external table avro_out (a string) stored as avro location 
'/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');

spark-sql> select * from avro_in;
hello
Time taken: 1.986 seconds, Fetched 1 row(s)

spark-sql> insert overwrite table avro_out select * from avro_in;

16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
Returning signal schema to indicate problem
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.(Dataset.scala:186)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at

[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572961#comment-15572961
 ] 

Sean Owen commented on SPARK-15565:
---

Not quite... this was sort of undone by 
https://issues.apache.org/jira/browse/SPARK-15899 and that was the problem now 
addressed in https://issues.apache.org/jira/browse/SPARK-17810

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Norvell updated SPARK-17920:
--
Attachment: avro_data
avro.avsc

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro.avsc, avro_data, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Norvell updated SPARK-17920:
--
Attachment: avro_data
avro.avsc

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro.avsc, avro_data, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Norvell updated SPARK-17920:
--
Attachment: (was: avro.avsc)

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.Delegating

[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Norvell updated SPARK-17920:
--
Attachment: (was: avro_data)

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.Delegating

[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues have reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues have reported, but appears again 
> in 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such 
> errors: Spark 2.0.0 used to create the spark-warehouse folder within the 
> current directory (which was good) and didn't complain about such weird 
> paths, even because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event

2016-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17917:
--
Priority: Minor  (was: Major)

Maybe, I suppose it will be a little tricky to define what the event is here, 
since the event is that something didn't happen. Still, whatever is triggering 
the log might reasonably trigger an event. I don't have a strong feeling on 
this partly because I'm not sure what the action then is -- kill the job?

> Convert 'Initial job has not accepted any resources..' logWarning to a 
> SparkListener event
> --
>
> Key: SPARK-17917
> URL: https://issues.apache.org/jira/browse/SPARK-17917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mario Briggs
>Priority: Minor
>
> When supporting Spark on a multi-tenant shared large cluster with quotas per 
> tenant, often a submitted taskSet might not get executors because quotas have 
> been exhausted (or) resources unavailable. In these situations, firing a 
> SparkListener event instead of just logging the issue (as done currently at 
> https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192),
>  would give applications/listeners an opportunity to handle this more 
> appropriately as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572976#comment-15572976
 ] 

Alessio commented on SPARK-15565:
-

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572976#comment-15572976
 ] 

Alessio edited comment on SPARK-15565 at 10/13/16 7:49 PM:
---

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that SPARK-17918 was a duplicate of 
SPARK-17810 and I'm glad this will be fixed.


was (Author: purple):
Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15369.
---
Resolution: Won't Fix

In the spirit of having more explicitly accept/rejects, and given the 
discussions so far on both this ticket and on the github pull request), I'm 
going to close this as won't fix for now. We can still continue to discuss here 
on the merits, but the reject is based on the following:

1. Maintenance cost of supporting another runtime.

2. Jython is years behind in terms of features with Cython (or even PyPy).

3. Jython cannot leverage any of the numeric tools available.

(In hindsight maybe PyPy support was also added prematurely.)



> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-13 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573056#comment-15573056
 ] 

Dmytro Bielievtsov commented on SPARK-10872:


[~sowen] Can you please give me some pointers in the source code so I can start 
working towards a pull request?

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.Hi

[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573089#comment-15573089
 ] 

Cody Koeninger commented on SPARK-17812:


One other slightly ugly thing...

{noformat}
// starting topicpartitions, no explicit offset
.option("assign", """{"topicfoo": [0, 1],"topicbar": [0, 1]}"""

// do you allow specifying with explicit offsets in the same config option? 
// or force it all into startingOffsetForRealzYo?
.option("assign", """{ "topicfoo" :{ "0": 1234, "1": 4567 }, "topicbar" : { 
"0": 1234, "1": 4567 }}""")
{noformat}

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17900:


Assignee: Reynold Xin  (was: Apache Spark)

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573090#comment-15573090
 ] 

Apache Spark commented on SPARK-17900:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15469

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17900:


Assignee: Apache Spark  (was: Reynold Xin)

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17834.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572922#comment-15572922
 ] 

Cody Koeninger edited comment on SPARK-17812 at 10/13/16 8:33 PM:
--

Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're 
dealing with a stream that has multiple topics in it.  Agree we should break 
that out, focus on defining starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is 
orthogonal to specifying some offsets, you have a situation like this:

{noformat}
.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 
}, "topicbar" : { "0": 1234, "1": 4567 }}""")
{noformat}

where startingOffsetForRealzYo has a more reasonable name that conveys it is 
specifying starting offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, 
means you're going to have to pick a key that isn't a valid topic, or add yet 
another layer of indirection.  It also makes it harder to make the format 
consistent with SPARK-17829 (which seems like a good thing to keep consistent, 
I agree)

Obviously I think you should just change the name, but it's your show.






was (Author: c...@koeninger.org):
Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're 
dealing with a stream that has multiple topics in it.  Agree we should break 
that out, focus on defining starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is 
orthogonal to specifying some offsets, you have a situation like this:

.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 
}, "topicbar" : { "0": 1234, "1": 4567 }}""")

where startingOffsetForRealzYo has a more reasonable name that conveys it is 
specifying starting offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, 
means you're going to have to pick a key that isn't a valid topic, or add yet 
another layer of indirection.  It also makes it harder to make the format 
consistent with SPARK-17829 (which seems like a good thing to keep consistent, 
I agree)

Obviously I think you should just change the name, but it's your show.





> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573109#comment-15573109
 ] 

Ofir Manor commented on SPARK-17812:


Regarding (1) - of course it is *all* data in the source, as of query start. 
Just the same as file system directory or a database table - I'm not sure a 
disclaimer that the directory or table could have had different data in the 
past adds anything but confusion...
Anyway, the startingOffset is confusing because, it seems you want a different 
parameter for "assign" --> to explicitly specify starting offsets.
For you use case, I would add:
5. Give me nnn messages (not last ones). We still do one of the above options 
(trying to go back nnn messages, somehow split between the topic-partitions 
involved), but not provide a more explicit guarantee like "last nnn". 
Generally, the distribution of messages to partitions don't have to be 
round-robin or uniform, it is strongly based on the key (example, could be 
state, could be URL etc).
Anyway, I haven't seen a concrete suggestion on how to specify offsets or 
timestamp, so I think that would be the next step in this ticket (I suggested 
you could condense all to one option to avoid dependencies between options, but 
I don't have an elegant "stringly" suggestion)

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17731) Metrics for Structured Streaming

2016-10-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-17731.
---
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.0.2, 2.1.0  (was: 2.1.0)

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-17921:
---

 Summary: checkpointLocation being set in memory streams fail after 
restart. Should fail fast
 Key: SPARK-17921
 URL: https://issues.apache.org/jira/browse/SPARK-17921
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Burak Yavuz


The checkpointLocation option in memory streams in StructuredStreaming is not 
used during recovery. However, it can use this location if it is being set. 
However, during recovery, if this location was set, we get an exception saying 
that we will not use this location for recovery, please delete it. 

It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573163#comment-15573163
 ] 

Apache Spark commented on SPARK-17921:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15470

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17921:


Assignee: Apache Spark

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17921:


Assignee: (was: Apache Spark)

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573166#comment-15573166
 ] 

Cody Koeninger commented on SPARK-17812:


Here's my concrete suggestion:

3 mutually exclusive ways of subscribing:

{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

2 non-mutually exclusive ways of specifying starting position, explicit 
startingOffsets obviously take priority:

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""")
.option("startingTime", "earliest" | "latest" | long)
{noformat}
where long is a timestamp, work to be done on that later.
Note that even kafka 0.8 has a (really crappy based on log file modification 
time) api for time so later pursuing timestamps startingTime doesn't 
necessarily exclude it



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >