Re: Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])
Definitely a bug. I just checked and it looks like we don't actually have a function that takes a Scala RDD and Seq[String]. cc Davies who added this code a while back. On Sun, Apr 19, 2015 at 2:56 PM, Justin Uang justin.u...@gmail.com wrote: Hi, I have a question regarding SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String]). It looks like when I try to call it, it results in an infinite recursion that overflows the stack. I filed it here: https://issues.apache.org/jira/browse/SPARK-6999. What is the best way to fix this? Is the intention that it indeed calls a scala implementation that infers the schema using the datatypes of the Rows as well as using the provided column names? Thanks! Justin
Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])
Hi, I have a question regarding SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String]). It looks like when I try to call it, it results in an infinite recursion that overflows the stack. I filed it here: https://issues.apache.org/jira/browse/SPARK-6999. What is the best way to fix this? Is the intention that it indeed calls a scala implementation that infers the schema using the datatypes of the Rows as well as using the provided column names? Thanks! Justin
Question about recovery from checkpoint exception[SPARK-6892]
Hi, When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Is someone can help me, The issue is SPARK-6892. thanks
Re: dataframe can not find fields after loading from hive
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote: This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote: Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote: I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset(label) === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw exception: resolved attributes label missing from label, user_id, ...(the rest of the fields of my table org.apache.spark.sql.AnalysisException: resolved attributes label missing from label, user_id, ... (the rest of the fields of my table) where we can see that the label field actually exist. I manage to solve this issue by updating my syntax to: val newdataset = dataset.where($label === 1) which works. However I can not make this trick in all my queries. For example, when I try to do a unionAll from two subsets of the same data frame the error I am getting is that all my fields are missing. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores -- Cesar Flores
Re: Question about recovery from checkpoint exception[SPARK-6892]
This is why spark.hadoop.validateOutputSpecs exists, really: https://spark.apache.org/docs/latest/configuration.html On Mon, Apr 20, 2015 at 3:40 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Is someone can help me, The issue is SPARK-6892. thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re:Re: Question about recovery from checkpoint exception[SPARK-6892]
Hi Sean Owen, Thank you for your attention. I know spark.hadoop.validateOutputSpecs. I restart the job, the application id is application_1428664056212_0017 and it recovery from checkpoint, and it will write eventLog into application_1428664056212_0016 dir, I think it shoud write to application_1428664056212_0017 not application_1428664056212_0016. At 2015-04-20 11:46:12,Sean Owen so...@cloudera.com wrote: This is why spark.hadoop.validateOutputSpecs exists, really: https://spark.apache.org/docs/latest/configuration.html On Mon, Apr 20, 2015 at 3:40 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Is someone can help me, The issue is SPARK-6892. thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org