date:20161013

Imran Rashid created SPARK-17911:


 Summary: Scheduler does not messageScheduler for 
ResubmitFailedStages
 Key: SPARK-17911
 URL: https://issues.apache.org/jira/browse/SPARK-17911
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.0.0
Reporter: Imran Rashid


Its not totally clear what the purpose of the {{messageScheduler}} is in 
{{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
should just clearly document its purpose.

This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11

But its tricky so breaking it out here for archiving the discussion.

Note: this issue requires a decision on what to do before a code change, so 
lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17890) scala.ScalaReflectionException

2016-10-13 Thread Khalid Reid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572217#comment-15572217
 ] 

Khalid Reid edited comment on SPARK-17890 at 10/13/16 3:30 PM:
---

Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF   

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here. Things work if I 
remove the toDF call

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:21)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint$Main$1(Main.scala:21)
at Main$delayedInit$body.apply(Main.scala:5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(Main.scala:5)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}



was (Author: kor):
Hi Sean,

I've created a small project [here|https://github.com/khalidr/Spark_17890] to 
reproduce the error using spark-submit.  I noticed that things work fine when 
using an RDD but fails when I use a DataFrame.

{code}
object Main extends App{

  val conf = new SparkConf()
  conf.setMaster("local")
  val session = SparkSession.builder()
.config(conf)
.getOrCreate()

  import session.implicits._

  val df = session.sparkContext.parallelize(List(1,2,3)).toDF

  println("flatmapping ...")
  df.flatMap(_ => Seq.empty[Foo])

  println("mapping...")
  df.map(_ => Seq.empty[Foo]) //spark-submit fails here

}
case class Foo(value:String)
{code}

{noformat}
Exception in thread "main" scala.ScalaReflectionException: class Foo not found.
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at Main$$typecreator3$1.apply(Main.scala:21)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
at Main$.delayedEndpoint

[jira] [Updated] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages


 [ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-17911:
-
Summary: Scheduler does not need messageScheduler for ResubmitFailedStages  
(was: Scheduler does not messageScheduler for ResubmitFailedStages)

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish commented on SPARK-17908:


Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: [columns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>

[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

[
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572289#comment-15572289
]

Imran Rashid commented on SPARK-17911:
--

Copying the earlier discussion on the PR here

from squito:
bq.I find myself frequently wondering about the purpose of this. Its commented
very tersely on RESUBMIT_TIMEOUT, but I think it might be nice to add a longer
comment here. I guess something like
bq. "If we get one fetch-failure, we often get more fetch failures across
multiple executors. We will get better parallelism when we resubmit the
mapStage if we can resubmit when we know about as many of those failures as
possible. So this is a heuristic to add a small delay to see if we gather a few
more failures before we resubmit."
bq.We do not need the delay to figure out exactly which shuffle-map outputs are
gone on the executor -- we always mark the executor as lost on a fetch failure,
which means we mark all its map output as gone. (This is really confusing -- it
looks like we only remove the one shuffle-map output that was involved in the
fetch failure, but then the entire removal is buried inside another method a
few lines further.)
bq.I did some browsing through history, and there used to be this comment
{noformat}
// Periodically resubmit failed stages if some map output fetches have
failed and we have
// waited at least RESUBMIT_TIMEOUT. We wait for this short time because
when a node fails,
// tasks on many other nodes are bound to get a fetch failure, and they
won't all get it at
// the same time, so we want to make sure we've identified all the reduce
tasks that depend
// on the failed node.
{noformat}
bq. at least in the current version, this also sounds like a bad reason to have
the delay. failedStage won't be resubmitted till mapStage completes anyway, and
then it'll look to see what tasks it is missing. Adding a tiny delay on top of
the natural delay for mapStage seems pretty pointless.
bq. I don't even think that the reason I gave in my suggested comment is a good
one -- do you really expect failures in multiple executors? But it is at least
logically consistent.

from markhamstra
bq. I don't like "Periodically" in your suggested comment, since this is a
one-shot action after a delay of RESUBMIT_TIMEOUT milliseconds.
bq.I agree that this delay-before-resubmit logic is suspect. If we are both
thinking correctly that a 200 ms delay on top of the time to re-run the
mapStage is all but inconsequential, then removing it in this PR would be fine.
If there are unanticipated consequences, though, I'd prefer to have that change
in a separate PR.

from squito
bq. yeah probably a separate PR, sorry this was just an opportunity for me to
rant :)
bq.And sorry if I worded it poorly, but I was not suggesting the one w/
"Periodically" as a better comment -- in fact I think its a bad comment, just
wanted to mention it was another description which used to be there long ago.
bq. This was my suggestion:
bq. If we get one fetch-failure, we often get more fetch failures across
multiple executors. We will get better parallelism when we resubmit the
mapStage if we can resubmit when we know about as many of those failures as
possible. So this is a heuristic to add a small delay to see if we gather a few
more failures before we resubmit.

from markhamstra
bq. Ah, sorry for ascribing the prior comment to your preferences. That comment
actually did make sense a long time ago when the resubmitting of stages really
was done periodically by an Akka scheduled event that fired every something
seconds. I'm pretty sure the RESUBMIT_TIMEOUT stuff is also legacy code that
doesn't make sense and isn't necessary any more.
bq. So, do you want to do the follow-up PR to get rid of it, or shall I?
bq. BTW, nothing wrong with your wording -- but my poor reading can create
misunderstanding of even the clearest text.
from squito
bq. if you are willing, could you please file the follow up? I am bouncing
between various things in my backlog -- though that change is small, I have a
feeling it will be merit extra discussion as a risky change, would be great if
you drive it
bq. Ok, I can get started on that. I believe that leaves this PR ready to merge.
bq. I think that I am going to backtrack on creating a new PR, because I think
that the RESUBMIT_TIMEOUT actually does still make sense.
bq. If we go way back in DAGScheduler history
(https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala)
we'll find that we had an event queue that was polled every 10 millis
(POLL_TIMEOUT) and that fetch failures didn't produce separate resubmit tasks
events, but rather we called resubmitFailedStages within the event
polling/handling loop if 50 millis (RESUBMIT_TIMEOUT) had passed since the last
time a FetchFailed was receiv

[jira] [Comment Edited] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages

[
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572289#comment-15572289
]

Imran Rashid edited comment on SPARK-17911 at 10/13/16 3:44 PM:

Copying the earlier discussion on the PR here

from markhamstra
bq. Ah, sorry for ascribing the prior comment to your preferences. That comment
actually did make sense a long time ago when the resubmitting of stages really
was done periodically by an Akka scheduled event that fired every something
seconds. I'm pretty sure the RESUBMIT_TIMEOUT stuff is also legacy code that
doesn't make sense and isn't necessary any more.
bq. So, do you want to do the follow-up PR to get rid of it, or shall I?
bq. BTW, nothing wrong with your wording -- but my poor reading can create
misunderstanding of even the clearest text.
from squito
bq. if you are willing, could you please file the follow up? I am bouncing
between various things in my backlog -- though that change is small, I have a
feeling it will be merit extra discussion as a risky change, would be great if
you drive it
from markhamstra
bq. Ok, I can get started on that. I believe that leaves this PR ready to merge.
from markhamstra
bq. I think that I am going to backtrack on creating a new PR, because I think
that the RESUBMIT_TIMEOUT actually does still make sense.
bq. If we go way back in DAGScheduler history
(https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala)
we'll find that we had an event queue that was polled every 10 millis
(POLL_TIMEOUT) and that fetch failures didn't produce separate resubmit tasks
events, but rather we called resubmitFailedStages within the event
polling/handling loop if

[jira] [Created] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2016-10-13 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-17912:


 Summary: Refactor code generation to get data for 
ColumnVector/ColumnarBatch
 Key: SPARK-17912
 URL: https://issues.apache.org/jira/browse/SPARK-17912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Kazuaki Ishizaki


Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
becoming pervasive. The code generation part can be reused by multiple 
components (e.g. parquet reader, data cache, and so on).
This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572330#comment-15572330
 ] 

Sean Owen commented on SPARK-17908:
---

Yes what's your code? This says one DF has just 'columns' as column. 

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17911) Scheduler does not need messageScheduler for ResubmitFailedStages


[ 
https://issues.apache.org/jira/browse/SPARK-17911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572344#comment-15572344
 ] 

Imran Rashid commented on SPARK-17911:
--

bq. In other words, handling a ResubmitFailedStages event should be quick, and 
causes failedStages to be cleared, allowing the next ResubmitFailedStages event 
to be posted from the handling of another FetchFailed. If there are the 
expected lot of fetch failures for a single stage, and there is no 
RESUBMIT_TIMEOUT, then it is quite likely that there will be a burst of 
resubmit events (and corresponding log messages) and submitStage calls made in 
rapid succession.

Lemme rephrase your comment to make sure I understand it.

The messageScheduler and delay do *not* effective correctness, or even what 
actually gets resubmitted.  When we resubmit {{mapStage}}, we always resubmit 
all tasks corresponding to shuffle map output on the failed executor.  And when 
we resubmit the {{failedStage}}, there is probably a long enough delay from 
{{mapStage}} that waiting 200ms is relatively inconsequential.

However, it *does* effect the logging.  If the scheduler event queue is 
relatively empty, then as the fetch failures trickle in, for each one we'd post 
a Resubmit event which gets handled relatively quickly.  So each fetch failure 
would trigger another resubmit event and more logging.  Which is also 
undesirable, both because of the noise in the logs, and b/c it would creates an 
unnecessary flood of events on the scheduler event queue.

Is that a fair summary?

I agree with everything said there, but then I'd request we take one of two 
actions:

1) change the resubmit logic -- in addition to checking failed stages, you can 
also check {{waitingStages}} and {{runningStages}}.  that is what happens 
eventually anyway inside {{resubmitFailedStages}}.  This would actually be even 
better for decreasing noise in the logs etc.
I know this may seem like a small thing to make such a big deal about, but I 
honestly think this is confusing enough that its worth cleaning up -- 
eliminating an unneeded event queue I think is a significant win.

2) If we don't do that, lets at least add in a better comment explaining the 
purpose.  (Maybe just a pointer to this jira at this point)

> Scheduler does not need messageScheduler for ResubmitFailedStages
> -
>
> Key: SPARK-17911
> URL: https://issues.apache.org/jira/browse/SPARK-17911
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Its not totally clear what the purpose of the {{messageScheduler}} is in 
> {{DAGScheduler}}.  It can perhaps be eliminated completely; or perhaps we 
> should just clearly document its purpose.
> This comes from a long discussion w/ [~markhamstra] on an unrelated PR here: 
> https://github.com/apache/spark/pull/15335/files/c80ad22a242255cac91cce2c7c537f9b21100f70#diff-6a9ff7fb74fd490a50462d45db2d5e11
> But its tricky so breaking it out here for archiving the discussion.
> Note: this issue requires a decision on what to do before a code change, so 
> lets just discuss it on jira first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:09 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: [columns];
at 
org.apache.spark.sql.ca

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:13 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'  and df coumns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 't

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572278#comment-15572278
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:12 PM:
--

Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', 'total'  and df1 coumns];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


was (Author: harishk15):
Traceback (most recent call last):
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/home/hpcuser/iri/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o376.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`key2`' given input 
columns: ['key1', 'key2', 'key3', '

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572373#comment-15572373
 ] 

Harish commented on SPARK-17908:


Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names.

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572377#comment-15572377
 ] 

Shivaram Venkataraman commented on SPARK-17904:
---

+1 I think this sounds good [~yanboliang]. We could also offer a contract to 
load these packages before running UDFs ? Otherwise the UDFs need to use 
`library(Matrix)` in every function.

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572373#comment-15572373
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:17 PM:
--

Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names. I am confirming  i 
can see the column name in the columns list eg: key2#20202   missing from 
key2#20202, key1#key2#23723, key3#20342 etc


was (Author: harishk15):
Sorry.. I didnt put the actual column names of my code in stack trace, i have 
modified the first line of stack trace for the column names.

> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema


[ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572394#comment-15572394
 ] 

Apache Spark commented on SPARK-17192:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15467

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14561) History Server does not see new logs in S3


[ 
https://issues.apache.org/jira/browse/SPARK-14561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572401#comment-15572401
 ] 

Steve Loughran commented on SPARK-14561:


To clarify: it's not changes in existing files that aren't showing up, *it is 
new files added to the same destination directory*


If that's the case, something is up with the scanning

#. set the logging of  org.apache.spark.deploy.history.FsHistoryProvider  to 
debug
# have a look at the scan interval. Is it too long? 

> History Server does not see new logs in S3
> --
>
> Key: SPARK-14561
> URL: https://issues.apache.org/jira/browse/SPARK-14561
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Miles Crawford
>
> If you set the Spark history server to use a log directory with an s3a:// 
> url, everything appears to work fine at first, but new log files written by 
> applications are not picked up by the server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics


[ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572414#comment-15572414
 ] 

Steve Loughran commented on SPARK-9004:
---

HADOOP-13605 added a whole new set of counters for HDFS, S3 and hopefully soon 
Azure; there's an API call on the FS {{getStorageStatistics()}} to query these.

One problem though: this isn't shipping in Hadoop branch-2 yet, so you can't 
write code that uses it, not unless there's some introspection/plugin 
mechanism. 

All the stats are just {{name: String -> value: Long}}, so a something to 
collect a {{Map[String, Long]}} would work. 

> Add s3 bytes read/written metrics
> -
>
> Key: SPARK-9004
> URL: https://issues.apache.org/jira/browse/SPARK-9004
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Abhishek Modi
>Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate 
> data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572421#comment-15572421
 ] 

Sean Owen commented on SPARK-17908:
---

You must be doing something different than what you show, because what you show 
doesn't even quite compile. Here I adapted your example from the Python example 
in the docs and ran it succesfully on 2.0.1:

{code}
import pyspark.sql.functions as func
from pyspark.sql.types import *

sc = spark.sparkContext
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: (p[0], p[1].strip()))
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in 
schemaString.split()]
schema = StructType(fields)
df = spark.createDataFrame(people, schema)

df1 = df.groupBy('name', 'age').agg(func.count(func.col('age')).alias('total'))
df3 = df.join(df1, ['name', 'age']).withColumn('newcol', 
func.col('age')/func.col('total'))
{code}


> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17714) ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName


[ 
https://issues.apache.org/jira/browse/SPARK-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572435#comment-15572435
 ] 

Sean Owen commented on SPARK-17714:
---

This is resolved now, right?

> ClassCircularityError is thrown when using 
> org.apache.spark.util.Utils.classForName 
> 
>
> Key: SPARK-17714
> URL: https://issues.apache.org/jira/browse/SPARK-17714
> Project: Spark
>  Issue Type: Bug
>Reporter: Weiqing Yang
>
> This jira is a follow up to [SPARK-15857| 
> https://issues.apache.org/jira/browse/SPARK-15857] .
> Task invokes CallerContext. SetCurrentContext() to set its callerContext to 
> HDFS. In SetCurrentContext(), it tries looking for class 
> {{org.apache.hadoop.ipc.CallerContext}} by using 
> {{org.apache.spark.util.Utils.classForName}}. This causes 
> ClassCircularityError to be thrown when running ReplSuite in master Maven 
> builds (The same tests pass in the SBT build). A hotfix 
> [SPARK-17710|https://issues.apache.org/jira/browse/SPARK-17710] has been made 
> by using Class.forName instead, but it needs further investigation.
> Error:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.3/2000/testReport/junit/org.apache.spark.repl/ReplSuite/simple_foreach_with_accumulator/
> {code}
> scala> accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, 
> name: None, value: 0)
> scala> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.ClassCircularityError: 
> io/netty/util/internal/_matchers_/org/apache/spark/network/protocol/MessageMatcher
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.Cl

[jira] [Commented] (SPARK-12571) AWS credentials not available for read.parquet in SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-12571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572442#comment-15572442
 ] 

Steve Loughran commented on SPARK-12571:


Means the credentials aren't at the far end, either in the XML (or, later 
Hadoop versions: env vars, IAM role data)

How was the code deployed? Standalone? Or via YARN?

> AWS credentials not available for read.parquet in SQLContext
> 
>
> Key: SPARK-12571
> URL: https://issues.apache.org/jira/browse/SPARK-12571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: repeated with s3n and s3a on hadoop 2.6 and hadoop 2.7.1
>Reporter: Kostiantyn Kudriavtsev
>
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
> at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
> at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
> at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles


[ 
https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572468#comment-15572468
 ] 

Steve Loughran commented on SPARK-8437:
---

Just came across by way of comments in the source. This *shouldn't* happen; the 
glob code ought to recognise when there is no wildcard and exit early —faster 
than if there was a wildcard. If its taking longer, then the full list process 
is making a mess of things, or somehow the result generation is playing up. If 
it was S3 only I'd blame the S3 APIs, but this sounds like S3 just amplifies a 
problem which may exist already

How big was the directory where this surfaced? Was it deep, wide or deep & wide?

> Using directory path without wildcard for filename slow for large number of 
> files with wholeTextFiles and binaryFiles
> -
>
> Key: SPARK-8437
> URL: https://issues.apache.org/jira/browse/SPARK-8437
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.3.1, 1.4.0
> Environment: Ubuntu 15.04 + local filesystem
> Amazon EMR + S3 + HDFS
>Reporter: Ewan Leith
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.4.1, 1.5.0
>
>
> When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
> of files in it, Spark hangs for a few minutes before processing the files.
> If you add a * to the end of the path, there is no delay.
> This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
> on S3.
> To reproduce, create a directory with 50,000 files in it, then run:
> val a = sc.binaryFiles("file:/path/to/files/")
> a.count()
> val b = sc.binaryFiles("file:/path/to/files/*")
> b.count()
> and monitor the different startup times.
> For example, in the spark-shell these commands are pasted in together, so the 
> delay at f.count() is from 10:11:08 t- 10:13:29 to output "Total input paths 
> to process : 4", then until 10:15:42 to being processing files:
> scala> val f = sc.binaryFiles("file:/home/ewan/large/")
> 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
> curMem=0, maxMem=278019440
> 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 156.9 KB, free 265.0 MB)
> 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
> curMem=160616, maxMem=278019440
> 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.9 KB, free 265.0 MB)
> 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
> 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
> :21
> f: org.apache.spark.rdd.RDD[(String, 
> org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
> BinaryFileRDD[0] at binaryFiles at :21
> scala> f.count()
> 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
> allocation with : CompletedNodes: 1, size left: 0
> 15/06/18 10:15:42 INFO SparkContext: Starting job: count at :24
> 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at :24) with 
> 4 output partitions (allowLocal=false)
> 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> :24)
> 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()
> Adding a * to the end of the path removes the delay:
> scala> val f = sc.binaryFiles("file:/home/ewan/large/*")
> 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
> curMem=0, maxMem=278019440
> 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 156.9 KB, free 265.0 MB)
> 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
> curMem=160616, maxMem=278019440
> 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 16.9 KB, free 265.0 MB)
> 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
> 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
> :21
> f: org.apache.spark.rdd.RDD[(String, 
> org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* 
> BinaryFileRDD[0] at binaryFiles at :21
> scala> f.count()
> 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4
> 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node 
> allocation wi

[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors


[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572472#comment-15572472
 ] 

Shivaram Venkataraman commented on SPARK-17902:
---

Good catch - Looks like this was changed in 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c 
which is a part of 1.6.0. Do you have a small test case that fails ?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572478#comment-15572478
 ] 

Harish commented on SPARK-17908:


Yes. You are code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  



> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17908) Column names Corrupted in pysaprk dataframe groupBy


[ 
https://issues.apache.org/jira/browse/SPARK-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572478#comment-15572478
 ] 

Harish edited comment on SPARK-17908 at 10/13/16 4:58 PM:
--

Yes. Your code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  




was (Author: harishk15):
Yes. You are code structure is same as mine.. But i have 70M records with 1000 
columns. It works with simple joins as above. But when you try to modify the DF 
multiple times this will happen, as i was getting this error from 1.6.0 but i 
didn't raise because i cant prove this with working use case. But it happens 
frequently with my code so i tried with rename

Here my steps:
df = df.select('key1', 'key2', 'key3', 'val','total') -70Million records
df =df.withColumn('key2', 'ABC')
df1= df.groupBy('key1', 'key2', 
'key3').agg(func.count(func.col('val')).alias('total'))
df1 = df1.columnRenamed('key2', 'key2')
df3 =df.join(df1, ['key1', 'key2', 'key3'])\
.withcolumn('newcol', func.col('val')/func.col('total'))


I just wanted to see if any one else observed this behavior, I will try to find 
the code sample to proof this issue. If not in another 1-2 days i will mark it 
not reproducible.  



> Column names Corrupted in pysaprk dataframe groupBy
> ---
>
> Key: SPARK-17908
> URL: https://issues.apache.org/jira/browse/SPARK-17908
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Harish
>Priority: Minor
>
> I have DF say df
> df1= df.groupBy('key1', 'key2', 
> 'key3').agg(func.count(func.col('val')).alias('total'))
> df3 =df.join(df1, ['key1', 'key2', 'key3'])\
>  .withcolumn('newcol', func.col('val')/func.col('total'))
> I am getting key2 is not present in df1, which is not truw becuase df1.show 
> () is having the data with the key2.
> Then i added this code  before join-- df1 = df1.columnRenamed('key2', 'key2') 
> renamed with same name. Then it works.
> Stack trace will say column missing, but it is npt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung commented on SPARK-17904:
--

I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-13 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572545#comment-15572545
 ] 

Gayathri Murali commented on SPARK-12664:
-

[~yanboliang] I am not working on this. Please feel free to take it

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung edited comment on SPARK-17904 at 10/13/16 5:09 PM:


I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.



was (Author: felixcheung):
I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572541#comment-15572541
 ] 

Felix Cheung edited comment on SPARK-17904 at 10/13/16 5:15 PM:


I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/
that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/conda-for-r
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

Other examples:
https://msdn.microsoft.com/en-us/microsoft-r/deployr-admin-r-package-management
http://blog.revolutionanalytics.com/2014/10/introducing-rrt.html

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.



was (Author: felixcheung):
I somewhat disagree, actually. In R, it is very common to use package 
management like
https://rstudio.github.io/packrat/

that is not very different from how Python does it.

In fact, Anaconda works with R too:
https://www.continuum.io/blog/developer-blog/anaconda-r-users-sparkr-and-rbokeh

To me, I think we have questions on whether Spark should get into the business 
of package management as [~srowen] has pointed out, or not.

Also there are challenges with how Spark does not have access to all 
nodes/executors (because it is not a cluster manager) and issues with dynamic 
resource allocations and so on as others have pointed out.


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17714) ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName

2016-10-13 Thread Weiqing Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572580#comment-15572580
 ] 

Weiqing Yang commented on SPARK-17714:
--

Not yet, need to investigate more. Could we pull in people more familiar with 
the Repl classloader stuff? Thanks.

> ClassCircularityError is thrown when using 
> org.apache.spark.util.Utils.classForName 
> 
>
> Key: SPARK-17714
> URL: https://issues.apache.org/jira/browse/SPARK-17714
> Project: Spark
>  Issue Type: Bug
>Reporter: Weiqing Yang
>
> This jira is a follow up to [SPARK-15857| 
> https://issues.apache.org/jira/browse/SPARK-15857] .
> Task invokes CallerContext. SetCurrentContext() to set its callerContext to 
> HDFS. In SetCurrentContext(), it tries looking for class 
> {{org.apache.hadoop.ipc.CallerContext}} by using 
> {{org.apache.spark.util.Utils.classForName}}. This causes 
> ClassCircularityError to be thrown when running ReplSuite in master Maven 
> builds (The same tests pass in the SBT build). A hotfix 
> [SPARK-17710|https://issues.apache.org/jira/browse/SPARK-17710] has been made 
> by using Class.forName instead, but it needs further investigation.
> Error:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.3/2000/testReport/junit/org.apache.spark.repl/ReplSuite/simple_foreach_with_accumulator/
> {code}
> scala> accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, 
> name: None, value: 0)
> scala> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.ClassCircularityError: 
> io/netty/util/internal/_matchers_/org/apache/spark/network/protocol/MessageMatcher
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apa

[jira] [Created] (SPARK-17913) Filter/join expressions can return incorrect results when comparing strings to longs

2016-10-13 Thread Ming Beckwith (JIRA)

Ming Beckwith created SPARK-17913:
-

 Summary: Filter/join expressions can return incorrect results when 
comparing strings to longs
 Key: SPARK-17913
 URL: https://issues.apache.org/jira/browse/SPARK-17913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
Reporter: Ming Beckwith


Reproducer:

{code}
  case class E(subject: Long, predicate: String, objectNode: String)

  def test(sc: SparkContext) = {
val sqlContext: SQLContext = new SQLContext(sc)
import sqlContext.implicits._

val broken = List(
  (19157170390056969L, "right", 19157170390056969L),
  (19157170390056973L, "wrong", 19157170390056971L),
  (19157190254313477L, "wrong", 19157190254313475L),
  (19157180859056133L, "wrong", 19157180859056131L),
  (19157170390056969L, "number", 161),
  (19157170390056971L, "string", "a string"),
  (19157190254313475L, "string", "another string"),
  (19157180859056131L, "number", 191)
)

val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, 
b._3.toString)).toDF()
val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
val fixed = brokenDF.filter(brokenDF("subject").cast("string") === 
brokenDF("objectNode"))

println("* incorrect filter results *")
println(brokenFilter.show())
println("* correct filter results *")
println(fixed.show())

println("* both sides cast to double *")
println(brokenFilter.explain())
  }

Broken filter returns:

+-+-+-+
|  subject|predicate|   objectNode|
+-+-+-+
|19157170390056969|right|19157170390056969|
|19157170390056973|wrong|19157170390056971|
|19157190254313477|wrong|19157190254313475|
|19157180859056133|wrong|19157180859056131|
+-+-+-+
{code}

The physical plan shows both sides of the expression are being cast to Double 
before evaluation. So while comparing numbers to a string number appears to 
work in many cases, when the numbers are sufficiently large and close together 
there is enough loss of precision to cause incorrect results. 

{code}
== Physical Plan ==
Filter (cast(subject#0L as double) = cast(objectNode#2 as double))

After casting the left side into strings, the filter returns the expected 
result:

+-+-+-+
|  subject|predicate|   objectNode|
+-+-+-+
|19157170390056969|right|19157170390056969|
+-+-+-+
{code}

Expected behavior in this case is probably to choose one side and cast the 
other (compare string to string or long to long) instead of using a data type 
with less precision. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"


[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572602#comment-15572602
 ] 

Felix Cheung commented on SPARK-17895:
--

would you like to fix this?

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572606#comment-15572606
 ] 

Felix Cheung commented on SPARK-17904:
--

For reference these are the related PRs for Python for package management that 
have not seen traction:
https://github.com/apache/spark/pull/13599
https://github.com/apache/spark/pull/14180


> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2016-10-13 Thread Oksana Romankova (JIRA)

Oksana Romankova created SPARK-17914:


 Summary: Spark SQL casting to TimestampType with nanosecond 
results in incorrect timestamp
 Key: SPARK-17914
 URL: https://issues.apache.org/jira/browse/SPARK-17914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Oksana Romankova


In some cases when timestamps contain nanoseconds they will be parsed 
incorrectly. 

Examples: 

"2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
"2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"

The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
assumes that only 6 digit fraction of a second will be passed.

With this being the case I would suggest either discarding nanoseconds 
automatically, or throw an exception prompting to pre-format timestamps to 
microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch


[ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572691#comment-15572691
 ] 

Apache Spark commented on SPARK-17912:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15467

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-13 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572689#comment-15572689
 ] 

Hossein Falaki commented on SPARK-17902:


Thanks for the pointer [~shivaram]. I will submit it patch with a regression 
test. The only obvious side-effect of this bug, is that collected type will be 
String, while it should have been a Factor. What makes it bad is that it is in 
our documentation and it used to work, so it is a regression.

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch


 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17912:


Assignee: (was: Apache Spark)

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch


 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17912:


Assignee: Apache Spark

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17882) RBackendHandler swallowing errors


 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17882.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Resolved by https://github.com/apache/spark/pull/15375

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: James Shuster
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17882) RBackendHandler swallowing errors


 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17882:
--
Assignee: James Shuster

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2016-10-13 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-17915:


 Summary: Prepare ColumnVector implementation for UnsafeData
 Key: SPARK-17915
 URL: https://issues.apache.org/jira/browse/SPARK-17915
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Kazuaki Ishizaki


Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} and 
{{OffHeapColumnarVector}}, which are optimized for reading data from Parquet. 
If they get an array, an map, or an struct from a {{Unsafe}} related data 
structure, it is inefficient.
This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that is 
optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-10-13 Thread Weiluo Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572721#comment-15572721
 ] 

Weiluo Ren commented on SPARK-17895:


Sure. Just want to collect some comments on the example to be added to the 
description here. Or I can first create a PR and get comments when people 
review it.

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp


[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572730#comment-15572730
 ] 

Sean Owen commented on SPARK-17914:
---

I think this is a duplicate of one of a couple possible issues, like 
https://issues.apache.org/jira/browse/SPARK-14428

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.


[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572741#comment-15572741
 ] 

Shivaram Venkataraman commented on SPARK-17904:
---

Thanks all - This is a good discussion about the role of executors and package 
management. The way I was thinking about this is that this could be similar to 
the maven coordinates that we support in say `spark-shell` and how Spark 
ensures these JAR files are loaded before execution.

On that note will be better if we take this in as an argument to 
sparkR.session() ? As opposed to having a function that can be called at any 
point in the middle that is

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData


 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17915:


Assignee: (was: Apache Spark)

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData


 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17915:


Assignee: Apache Spark

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData