master hung after killing the streaming sc

2015-09-18 Thread ZhuGe
Hi there:we recently deploy a streaming application in our stand alone cluster. And we found a issue when we trying to stop the streaming sc(has been working for several days)with the kill command in the spark ui. By kill command, i mean the 'kill' button in the "Submission ID" column of

Notification on Spark Streaming job failure

2015-09-18 Thread Krzysztof Zarzycki
Hi there Spark Community, I would like to ask you for an advice: I'm running Spark Streaming jobs in production. Sometimes these jobs fail and I would like to get email notification about it. Do you know how I can set up Spark to notify me by email if my job fails? Or do I have to use external

Re: parquet error

2015-09-18 Thread Chengi Liu
Hi, I did some digging.. I believe the error is caused by jets3t jar. Essentially these lines locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 'org/apache/hadoop/fs/s3/S3Credentials',

Re: Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread Saisai Shao
Hi Swetha, The problem of stack overflow is that when recovering from checkpoint data, Java will use a recursive way to deserialize the call stack, if you have a large call stack, this recursive way can easily lead to stack overflow. This is caused by Java deserialization mechanism, you need to

What's the best practice to parse JSON using spark

2015-09-18 Thread Cui Lin
Hello,All, Parsing JSON's nested structure is easy if using Java or Python API. Where I can find the similar way to parse JSON file using spark? Another question is by using SparkSQL, how can i easily save the results into NOSQL DB? any examples? Thanks a lot! -- Best regards! Lin,Cui

SparkML pipelines and error recovery

2015-09-18 Thread Fatma Ozcan
Trying to understand how Spark ML pipelines work in case of failures. If I have multiple transformers and one of them fails, will the lineage based recovery of rdd's automatically kick in? Thanks, Fatma

Re: What's the best practice to parse JSON using spark

2015-09-18 Thread Ted Yu
For #2, please see: examples/src/main/scala//org/apache/spark/examples/HBaseTest.scala examples/src/main/scala//org/apache/spark/examples/pythonconverters/HBaseConverters.scala In hbase, there is hbase-spark module which is being polished. Should be available in hbase 1.3.0 release. Cheers On

Re: What's the best practice to parse JSON using spark

2015-09-18 Thread Ted Yu
For #1, see this thread: http://search-hadoop.com/m/q3RTti0Thneenne2 For #2, also see: examples//src/main/python/hbase_inputformat.py examples//src/main/python/hbase_outputformat.py Cheers On Fri, Sep 18, 2015 at 5:12 PM, Ted Yu wrote: > For #2, please see: > >

Re: Checkpointing with Kinesis

2015-09-18 Thread Nick Pentreath
Are you doing actual transformations / aggregation in Spark Streaming? Or just using it to bulk write to S3? If the latter, then you could just use your AWS Lambda function to read directly from the Kinesis stream. If the former, then perhaps either look into the WAL option that Aniket mentioned,

Running the deep-learning the application on cluster:

2015-09-18 Thread Angel Angel
Hello,

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Tathagata Das
Also, giving more log4j messages around the error area would be useful. On Thu, Sep 17, 2015 at 1:02 PM, Cody Koeninger wrote: > Is there a particular reason you're calling checkpoint on the stream in > addition to the streaming context? > > On Thu, Sep 17, 2015 at 2:36 PM,

Re: WAL on S3

2015-09-18 Thread Tathagata Das
I dont think it would work with multipart upload either. The file is not visible until the multipart download is explicitly closed. So even if each write a part upload, all the parts are not visible until the multiple download is closed. TD On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran

Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2015-09-18 Thread shahab
Hi, Probably I have wrong zeppelin configuration, because I get the following error when I execute spark statements in Zeppelin: org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use

SparkContext declared as object variable

2015-09-18 Thread Priya Ch
Hello All, Instead of declaring sparkContext in main, declared as object variable as - object sparkDemo { val conf = new SparkConf val sc = new SparkContext(conf) def main(args:Array[String]) { val baseRdd = sc.parallelize() . . . } } But this piece of code is giving

Re: Python UDF and explode error

2015-09-18 Thread Pavel Burdanov
This is similar to SPARK-10685 and SPARK-9131, but in my case the error reproduces even in local mode with one worker. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-UDF-and-explode-error-tp24736p24740.html Sent from the Apache Spark User List

unsubscribe

2015-09-18 Thread Nambi
unsubscribe

SparkR pca?

2015-09-18 Thread Deborah Siegel
Hi, Can PCA be implemented in a SparkR-MLLib integration? perhaps 2 separate issues.. 1) Having the methods in SparkRWrapper and RFormula which will send the right input types through the pipeline MLLib PCA operates either on a RowMatrix, or the feature vector of an RDD[LabeledPoint]. The

Re: spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Yasemin Kaya
Thanks, I try to make but i can't. JavaPairRDD unlabeledTest, the vector is Dence vector. I add import org.apache.spark.sql.SQLContext.implicits$ but there is no method toDf(), I am using Java not Scala. 2015-09-18 20:02 GMT+03:00 Feynman Liang : > What

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: unsubscribe

2015-09-18 Thread Richard Hillegas
To unsubscribe from the user list, please send a message to user-unsubscr...@spark.apache.org as described here: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Michal Čizmazia
Hi Petr, after Ctrl+C can you see the following message in the logs? Invoking stop(stopGracefully=false) Details: https://github.com/apache/spark/pull/6307 On 18 September 2015 at 10:28, Petr Novak wrote: > It might be connected with my problems with gracefulShutdown in

Re: Spark streaming to database exception handling

2015-09-18 Thread chyeers
I have the same problem. I had use spark streaming to save data to hbase with using JDBC of phoenix .It seems to be ok sometimes,but missing data while the exception happened during write data to external storagorg.apache.phoenix.exception.BatchUpdateExecution: ERROR 1106 (XCL06): Exception while

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Adrian Tanase
Reading through the docs it seems that with a combination of FAIR scheduler and maybe pools you can get pretty far. However the smallest unit of scheduled work is the task so probably you need to think about the parallelism of each transformation. I'm guessing that by increasing the level of

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Adrian Tanase
Forgot to mention that you could also restrict the parallelism to 4, essentially using only 4 cores at any given time, however if your job is complex, a stage might be broken into more than 1 task... Sent from my iPhone On 19 Sep 2015, at 08:30, Adrian Tanase

Re: Using Spark for portfolio manager app

2015-09-18 Thread Adrian Tanase
Cool use case! You should definitely be able to model it with Spark. For the first question it's pretty easy - you probably need to keep the user portfolios as state using updateStateByKey. You need to consume 2 event sources - user trades and stock changes. You probably want to Cogroup the

Using Spark for portfolio manager app

2015-09-18 Thread Thúy Hằng Lê
Hi all, I am going to build a financial application for Portfolio Manager, where each portfolio contains a list of stocks, the number of shares purchased, and the purchase price. Another source of information is stocks price from market data. The application need to calculate real-time gain or

Re: in joins, does one side stream?

2015-09-18 Thread Reynold Xin
Yes for RDD -- both are materialized. No for DataFrame/SQL - one side streams. On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers wrote: > in scalding we join with the smaller side on the left, since the smaller > side will get buffered while the bigger side streams through the

Re: Checkpointing with Kinesis

2015-09-18 Thread Michal Čizmazia
FYI re WAL on S3 http://search-hadoop.com/m/q3RTtFMpd41A7TnH/WAL+S3=WAL+on+S3 On 18 September 2015 at 13:32, Alan Dipert wrote: > Hello, > > Thanks all for considering our problem. We are doing transformations in > Spark Streaming. We have also since learned that WAL to S3

Not able to group by Scala UDF

2015-09-18 Thread Jeff Jones
I’m trying to perform a Spark SQL (1.5) query containing a UDF in the select and group by clauses. From what I’ve been able to find this should be supported. A few examples include https://github.com/spirom/LearningSpark/blob/master/src/main/scala/sql/UDF.scala,

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
I don't think yarn-cluster mode is currently supported. You may want to ask zeppelin community for confirmation though. On Fri, Sep 18, 2015, 5:41 PM shahab wrote: > It works using yarn-client but I want to make it running on cluster. Is > there any way to do so? > >

Constant Spark execution time with different # of slaves

2015-09-18 Thread Warfish
Hi everyone, for research purposes I wanted to see how Spark scales for my algorithm with regards to different cluster sizes. I have a cluster with 10 nodes with 6 cores and 45 GB of RAM each. My algorithm takes approximately 15 minutes to execute on all nodes (as seen in Spark UI, each node was

Re: master hung after killing the streaming sc

2015-09-18 Thread Tathagata Das
Do you have event logs enabled? Streaming + event logs enabled can hang master - https://issues.apache.org/jira/browse/SPARK-6270 On Thu, Sep 17, 2015 at 11:35 PM, ZhuGe wrote: > Hi there: > we recently deploy a streaming application in our stand alone cluster. And > we found

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread shahab
It works using yarn-client but I want to make it running on cluster. Is there any way to do so? best, /Shahab On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Can you try yarn-client mode? > > On Fri, Sep 18, 2015, 3:38 PM shahab

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-18 Thread Vipul Rai
Hi Nick/Igor, ​​ Any solution for this ? Even I am having the same issue and copying jar to each executor is not feasible if we use lot of jars. Thanks, Vipul

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
Can you try yarn-client mode? On Fri, Sep 18, 2015, 3:38 PM shahab wrote: > Hi, > > Probably I have wrong zeppelin configuration, because I get the following > error when I execute spark statements in Zeppelin: > > org.apache.spark.SparkException: Detected yarn-cluster

Re: application failed on large dataset

2015-09-18 Thread 周千昊
Hi, The issue turn outs to be a memory issue. Thanks for the guidance. 周千昊 于2015年9月17日周四 下午12:39写道: > indeed, the operation in this stage is quite memory consuming. > We are trying to enable the printGCDetail option and see what is going on. > > java8964

Re: WAL on S3

2015-09-18 Thread Steve Loughran
> On 17 Sep 2015, at 21:40, Tathagata Das wrote: > > Actually, the current WAL implementation (as of Spark 1.5) does not work with > S3 because S3 does not support flushing. Basically, the current > implementation assumes that after write + flush, the data is immediately

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
It might be connected with my problems with gracefulShutdown in Spark 1.5.0 2.11 https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 Maybe Ctrl+C corrupts checkpoints and breaks gracefulShutdown? Petr On Fri, Sep 18, 2015 at 4:10 PM, Petr Novak wrote: > ...to

Breakpoints not hit with Scalatest + intelliJ

2015-09-18 Thread Michel Lemay
Hi, I'm adding unit tests to some utility functions that are using SparkContext but I'm unable to debug code and hit breakpoints when running under IntelliJ. My test creates a SparkContext from a SparkConf().setMaster("local[*]")... However, when I place a breakpoint in the functions inside a

spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Yasemin Kaya
Hi, I am using *spark 1.5, ML Pipeline Decision Tree * to get tree's probability. But I have to convert my data to Dataframe type. While creating model there is no problem but when I am using model on my data there is a

Python UDF and explode error

2015-09-18 Thread Pavel Burdanov
from pyspark.sql.types import * import pyspark.sql.functions as F import re re_expr = re.compile('P([0-9]+)') expr_split = F.udf(lambda x: list(map(int, re_expr.findall(x))), ArrayType(IntegerType())) c = sc.parallelize([(i, 'P1') for i in range(10 ** 6)], 1).toDF(['a', 'b'])

Re: Breakpoints not hit with Scalatest + intelliJ

2015-09-18 Thread Stephen Boesch
Hi Michel, please try local[1] and report back if the breakpoint were hit. 2015-09-18 7:37 GMT-07:00 Michel Lemay : > Hi, > > I'm adding unit tests to some utility functions that are using > SparkContext but I'm unable to debug code and hit breakpoints when running > under

Re: Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-18 Thread Petr Novak
I removed custom shutdown hook and it still doesn't work. I'm using KafkaDirectStream. I sometimes get java.lang.InterruptedException on Ctrl+C sometimes it goes through fine. I have this code now: ... some stream processing ... ssc.start() ssc.awaitTermination() ssc.stop(stopSparkContext =

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-18 Thread Ellen Kraffmiller
Thanks for your response. Is there a reason why this thread isn't appearing on the mailing list? So far, I only see my post, with no answers, although I have received 2 answers via email. It would be nice if other people could see these answers as well. On Thu, Sep 17, 2015 at 2:22 AM, Sun,

GraphX to work with Streaming

2015-09-18 Thread Rohit Kumar
Hi I am having a setup where I have the edges coming as a stream I want to create a graph in GraphX which updates its structure after new edge comes. Is there any way to do this using spark streaming and graphx? Regards Rohit

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
This one is generated, I suppose, after Ctrl+C 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor]

Re: Caching intermediate results in Spark ML pipeline?

2015-09-18 Thread Jingchu Liu
Thanks buddy I'll try it out in my project. Best, Lewis 2015-09-16 13:29 GMT+08:00 Feynman Liang : > If you're doing hyperparameter grid search, consider using > ml.tuning.CrossValidator which does cache the dataset >

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on Spark 1.5.0 2.11. It would be nice if anybody could try on another installation to ensure it is something wrong on my cluster. Many thanks, Petr On Fri, Sep 18, 2015 at 4:07 PM, Petr Novak wrote: >

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
...to ensure it is not something wrong on my cluster. On Fri, Sep 18, 2015 at 4:09 PM, Petr Novak wrote: > I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on > Spark 1.5.0 2.11. It would be nice if anybody could try on another > installation to

Re: Spark + Druid

2015-09-18 Thread Harish Butani
Hi, I have just posted a Blog on this: https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani regards, Harish Butani. On Tue, Sep 1, 2015 at 11:46 PM, Paolo Platter wrote: > Fantastic!!! I will look into that and I hope to

Re: spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Feynman Liang
What is the type of unlabeledTest? SQL should be using the VectorUDT we've defined for Vectors so you should be able to just "import sqlContext.implicits._" and then call

Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread swetha
Hi, When I try to recover my Spark Streaming job from a checkpoint directory, I get a StackOverFlow Error as shown below. Any idea as to why this is happening? 15/09/18 09:02:20 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.StackOverflowError

Re: Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread Ted Yu
Which version of Java are you using ? And release of Spark, please. Thanks On Fri, Sep 18, 2015 at 9:15 AM, swetha wrote: > Hi, > > When I try to recover my Spark Streaming job from a checkpoint directory, I > get a StackOverFlow Error as shown below. Any idea as to

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-18 Thread Luciano Resende
I see the thread with all the responses on the bottom at mail-archive : https://www.mail-archive.com/user%40spark.apache.org/msg36882.html On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller < ellen.kraffmil...@gmail.com> wrote: > Thanks for your response. Is there a reason why this thread

Re: Spark on YARN / aws - executor lost on node restart

2015-09-18 Thread Adrian Tanase
Hi guys, Digging up this question after spending some more time trying to replicate it. It seems to be an issue with the YARN – spark integration, wondering if there is a bug already tracking this? If I just kill the process on the machine, YARN detects the container is dead and the spark

Re: Null Value in DecimalType column of DataFrame

2015-09-18 Thread Dirceu Semighini Filho
Hi Yin, I got that part. I just think that instead of returning null, throwing an exception would be better. In the exception message we can explain that the DecimalType used can't fit the number that is been converted due to the precision and scale values used to create it. It would be easier for

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Philip Weaver
Here's a specific example of what I want to do. My Spark application is running with total-executor-cores=8. A request comes in, it spawns a thread to handle that request, and starts a job. That job should use only 4 cores, not all 8 of the cores available to the cluster.. When the first job is

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Philip Weaver
(whoops, redundant sentence in that first paragraph) On Fri, Sep 18, 2015 at 8:36 AM, Philip Weaver wrote: > Here's a specific example of what I want to do. My Spark application is > running with total-executor-cores=8. A request comes in, it spawns a thread > to handle

Re: parquet error

2015-09-18 Thread Cheng Lian
Not sure what's happening here, but I guess it's probably a dependency version issue. Could you please give vanilla Apache Spark a try to see whether its a CDH specific issue or not? Cheng On 9/17/15 11:44 PM, Chengi Liu wrote: Hi, I did some digging.. I believe the error is caused by

Re: Checkpointing with Kinesis

2015-09-18 Thread Alan Dipert
Hello, Thanks all for considering our problem. We are doing transformations in Spark Streaming. We have also since learned that WAL to S3 on 1.4 is "not reliable" [1] We are just going to wait for EMR to support 1.5 and hopefully this won't be a problem anymore [2]. Alan 1.