Re: Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])

2015-04-19 Thread Reynold Xin
Definitely a bug. I just checked and it looks like we don't actually have a
function that takes a Scala RDD and Seq[String].

cc Davies who added this code a while back.


On Sun, Apr 19, 2015 at 2:56 PM, Justin Uang justin.u...@gmail.com wrote:

 Hi,

 I have a question regarding SQLContext#createDataFrame(JavaRDD[Row],
 java.util.List[String]). It looks like when I try to call it, it results in
 an infinite recursion that overflows the stack. I filed it here:
 https://issues.apache.org/jira/browse/SPARK-6999.

 What is the best way to fix this? Is the intention that it indeed calls a
 scala implementation that infers the schema using the datatypes of the Rows
 as well as using the provided column names?

 Thanks!

 Justin



Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])

2015-04-19 Thread Justin Uang
Hi,

I have a question regarding SQLContext#createDataFrame(JavaRDD[Row],
java.util.List[String]). It looks like when I try to call it, it results in
an infinite recursion that overflows the stack. I filed it here:
https://issues.apache.org/jira/browse/SPARK-6999.

What is the best way to fix this? Is the intention that it indeed calls a
scala implementation that infers the schema using the datatypes of the Rows
as well as using the provided column names?

Thanks!

Justin


Question about recovery from checkpoint exception[SPARK-6892]

2015-04-19 Thread wyphao.2007
Hi, 
   When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, 
I found it will reuse the application id (In my case is 
application_1428664056212_0016) before falied to write spark eventLog, But now 
my application id is application_1428664056212_0017,then spark write eventLog 
will falied, the stacktrace as follow:
15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, 
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
java.io.IOException: Target log file already exists 
(hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Is someone can help me, The issue is SPARK-6892.
thanks





Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar,

Can you try 1.3.1 (
https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it
still shows the error?

Thanks,

Yin

On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin r...@databricks.com wrote:

 This is strange. cc the dev list since it might be a bug.



 On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote:

 Never mind. I found the solution:

 val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
 hiveLoadedDataFrame.schema)

 which translate to convert the data frame to rdd and back again to data
 frame. Not the prettiest solution, but at least it solves my problems.


 Thanks,
 Cesar Flores



 On Thu, Apr 16, 2015 at 11:17 AM, Cesar Flores ces...@gmail.com wrote:


 I have a data frame in which I load data from a hive table. And my issue
 is that the data frame is missing the columns that I need to query.

 For example:

 val newdataset = dataset.where(dataset(label) === 1)

 gives me an error like the following:

 ERROR yarn.ApplicationMaster: User class threw exception: resolved
 attributes label missing from label, user_id, ...(the rest of the fields of
 my table
 org.apache.spark.sql.AnalysisException: resolved attributes label
 missing from label, user_id, ... (the rest of the fields of my table)

 where we can see that the label field actually exist. I manage to solve
 this issue by updating my syntax to:

 val newdataset = dataset.where($label === 1)

 which works. However I can not make this trick in all my queries. For
 example, when I try to do a unionAll from two subsets of the same data
 frame the error I am getting is that all my fields are missing.

 Can someone tell me if I need to do some post processing after loading
 from hive in order to avoid this kind of errors?


 Thanks
 --
 Cesar Flores




 --
 Cesar Flores





Re: Question about recovery from checkpoint exception[SPARK-6892]

2015-04-19 Thread Sean Owen
This is why spark.hadoop.validateOutputSpecs exists, really:
https://spark.apache.org/docs/latest/configuration.html

On Mon, Apr 20, 2015 at 3:40 AM, wyphao.2007 wyphao.2...@163.com wrote:
 Hi,
When I recovery from checkpoint in yarn-cluster mode using Spark 
 Streaming, I found it will reuse the application id (In my case is 
 application_1428664056212_0016) before falied to write spark eventLog, But 
 now my application id is application_1428664056212_0017,then spark write 
 eventLog will falied, the stacktrace as follow:
 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' 
 failed, java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 Is someone can help me, The issue is SPARK-6892.
 thanks




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re:Re: Question about recovery from checkpoint exception[SPARK-6892]

2015-04-19 Thread wyphao.2007
Hi Sean Owen, Thank you for your attention.


I know spark.hadoop.validateOutputSpecs. 


I restart the job, the application id is application_1428664056212_0017 and it 
recovery from checkpoint, and it will write eventLog into 
application_1428664056212_0016 dir, I think it shoud write to 
application_1428664056212_0017 not application_1428664056212_0016.



At 2015-04-20 11:46:12,Sean Owen so...@cloudera.com wrote:
This is why spark.hadoop.validateOutputSpecs exists, really:
https://spark.apache.org/docs/latest/configuration.html

On Mon, Apr 20, 2015 at 3:40 AM, wyphao.2007 wyphao.2...@163.com wrote:
 Hi,
When I recovery from checkpoint in yarn-cluster mode using Spark 
 Streaming, I found it will reuse the application id (In my case is 
 application_1428664056212_0016) before falied to write spark eventLog, But 
 now my application id is application_1428664056212_0017,then spark write 
 eventLog will falied, the stacktrace as follow:
 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' 
 failed, java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 Is someone can help me, The issue is SPARK-6892.
 thanks




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org