Re: How to load partial data from HDFS using Spark SQL

2016-01-02 Thread swetha kasireddy
OK. What should the table be? Suppose I have a bunch of parquet files, do I just specify the directory as the table? On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY wrote: > Ok, so whats wrong in using : > > var df=HiveContext.sql("Select * from table where id = ") >

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
Hi Tomasz, The GMM is bind with the peer Java GMM object, so it need reference to SparkContext. Some of MLlib(not ML) models are simple object such as KMeansModel, LinearRegressionModel etc., but others will refer SparkContext. The later ones and corresponding member functions should not called

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to

RE: frequent itemsets

2016-01-02 Thread LinChen
Hi Roberto,What is the minimum support threshold you set? Could you check which stage you ran into StackOverFlow exception? Thanks. From: roberto.pagli...@asos.com To: yblia...@gmail.com CC: user@spark.apache.org Subject: Re: frequent itemsets Date: Sat, 2 Jan 2016 12:01:31 + Hi Yanbo,

Does state survive application restart in StatefulNetworkWordCount?

2016-01-02 Thread Rado Buranský
I am trying to understand how state in Spark Streaming works in general. If I run this example program twice will the second run see state from the first run? https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala It

Re: frequent itemsets

2016-01-02 Thread Roberto Pagliari
Hi Yanbo, Unfortunately, I cannot share the data. I am using the code in the tutorial https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html Did you ever try run it when there are hundreds of millions of co-purchases of at least two products? I suspect AR does not handle that

Re: feedback on the use of Spark’s gateway hidden REST API (standalone cluster mode) for application submission

2016-01-02 Thread Jim Lohse
There is a lot of interesting info about this API here: https://issues.apache.org/jira/browse/SPARK-5388 I got that from a comment thread on the last link in your PR. Thanks for bringing this up! I knew you could check status via REST per

feedback on the use of Spark’s gateway hidden REST API (standalone cluster mode) for application submission

2016-01-02 Thread HILEM Youcef
Happy new year. I solicit community feedback on the use of Spark’s gateway hidden REST API (standalone cluster mode) for application submission. We already use status checking and cancellation in our ansible scripts. I also opened a ticket to make this API public

Re: SparkSQL integration issue with AWS S3a

2016-01-02 Thread KOSTIANTYN Kudriavtsev
thanks Jerry, it works! really appreciate your help Thank you, Konstantin Kudryavtsev On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam wrote: > Hi Kostiantyn, > > You should be able to use spark.conf to specify s3a keys. > > I don't remember exactly but you can add hadoop

Re: Cannot get repartitioning to work

2016-01-02 Thread jimitkr
Thanks. Repartitioning works now. Thread closed :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-get-repartitioning-to-work-tp25852p25858.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to read JSON input in Spark (YARN Cluster)

2016-01-02 Thread Vijay Gharge
Hi Few suggestions: 1. Try storage mode as "memory and disk" both. >> to verify heap memory error 2. Try to copy and read json source file from local filesystem (i.e. Without hdfs) >> to verify minimum working code 3. Looks like some library issue which is causing lzo telated error. On Saturday

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-02 Thread Prabhu Joseph
The attached image just has thread states, and WAITING threads need not be the issue. We need to take thread stack traces and identify at which area of code, threads are spending lot of time. Use jstack -l or kill -3 , where pid is the process id of the executor process. Take jstack stack trace

Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Chris Fregly
Looks like you're not registering the input param correctly. Below are examples from the Spark Java source that show how to build a custom transformer. Note that a Model is a Transformer. Also, that chimpler/wordpress/naive bayes example is a bit dated. I tried to implement it a while ago, but

Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
You can use CrossValidator/TrainingValidationSplit with ParamGridBuilder and Evaluator to empirically choose the model hyper parameters (ie. numFeatures) per the following: http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation

Re: frequent itemsets

2016-01-02 Thread Roberto Pagliari
Hi Lin, >From 1e-5 and below it crashes with me. I also developed my own program in C++ >(single machine, no spark) and I was able to compute all itemsets, that is, >support = 0. Stack overflow definitely occur when computing frequent itemset, before association rule even starts. If you want,

RE: frequent itemsets

2016-01-02 Thread LinChen
Hi roberto,I have ever done some experiments on a dataset with 3196 transactions and 289154813 frequent itemsets. FPGrowth can finish the computing within 10 minutes. I can have a try if you could share the artificial dataset. From: roberto.pagli...@asos.com To: m2linc...@outlook.com CC: