How to clear Kafka offset in Spark streaming?

2015-09-13 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class kafka.tools.Consu

Re: Is there any Spark SQL reference manual?

2015-09-13 Thread vivek bhaskar
Thanks Richard, Ted. Hope we have some reference available soon. Peymen, I had a look at this link before at this but was looking for something with broader coverage. PS: Richard, Kindly advise me for generating BNF description of the grammar via derby build script. Since this may not be of spar

Re: Training the MultilayerPerceptronClassifier

2015-09-13 Thread Feynman Liang
AFAIK no, we have a TODO item to implement more rigorous correctness tests (e.g. referenced against Weka). If you're interested, go ahead and comment the J

Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-13 Thread Todd
Thanks Davies for the explanation. When I turn off the following options, I still see that spark1.5 is much slower than 1.4.1. I am thinking how I can configure so that spark 1.5 can have similar performance as spark1.4 for this particular query.. --conf spark.sql.planner.sortMergeJoin=false

Re: Re: java.util.NoSuchElementException: key not found

2015-09-13 Thread guoqing0...@yahoo.com.hk
Thank u very much ! when will the Spark 1.5.1 come out. guoqing0...@yahoo.com.hk From: Yin Huai Date: 2015-09-12 04:49 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: java.util.NoSuchElementException: key not found Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it h

Re: Problem to persist Hibernate entity from Spark job

2015-09-13 Thread Zoran Jeremic
Hi guys, I'm still trying to solve the issue with saving Hibernate entities from Spark. After several attempts to redesign my own code I ended up with HelloWorld example which clearly demonstrates that it's not the problem in complexity of my code and session mixing in threads. The code given bel

Replacing Esper with Spark Streaming?

2015-09-13 Thread Otis Gospodnetić
Hi, I'm wondering if anyone has attempted to replace Esper with Spark Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the (CEP) job? We are considering Akka or Spark Streaming as possible Esper replacements and would appreciate any input from people who tried to do that wit

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu
Please also see this thread: http://search-hadoop.com/m/q3RTtGpLeLyv97B1 On Sun, Sep 13, 2015 at 9:49 AM, Ted Yu wrote: > For #1, there is the following method: > > @DeveloperApi > def getExecutorStorageStatus: Array[StorageStatus] = { > assertNotStopped() > > You can wrap the call in tr

Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc

Re: CREATE TABLE ignores database when using PARQUET option

2015-09-13 Thread hbogert
I'm having the same problem, did you solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824p24679.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD transformation and action running out of memory

2015-09-13 Thread Utkarsh Sengar
Yup, that was the problem. Changing the default " mongo.input.split_size" from 8MB to 100MB did the trick. Config reference: https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference Thanks! On Sat, Sep 12, 2015 at 3:15 PM, Richard Eggert wrote: > Hmm... The count() method invokes t

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small file

Parquet partitioning performance issue

2015-09-13 Thread sonal sharma
Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file. Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller parti

Re: Stopping SparkContext and HiveContext

2015-09-13 Thread Ted Yu
For #1, there is the following method: @DeveloperApi def getExecutorStorageStatus: Array[StorageStatus] = { assertNotStopped() You can wrap the call in try block catching IllegalStateException. Of course, this is just a workaround. FYI On Sun, Sep 13, 2015 at 1:48 AM, Ophir Cohen wrote

Re: Data lost in spark streaming

2015-09-13 Thread Ted Yu
Can you retrieve log for appattempt_1440495451668_0258_01 and see if there is some clue there ? Cheers On Sun, Sep 13, 2015 at 3:28 AM, Bin Wang wrote: > There is some error logs in the executor and I don't know if it is related: > > 15/09/11 10:54:05 WARN ipc.Client: Exception encountered

Re: change the spark version

2015-09-13 Thread Steve Loughran
> On 12 Sep 2015, at 09:14, Sean Owen wrote: > > This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 > has 1.5. The best thing is to update CDH as a whole if you can. > > However it's pretty simple to just run a newer Spark assembly as a > YARN app. Don't remove anything in the

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
There is some error logs in the executor and I don't know if it is related: 15/09/11 10:54:05 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$Inv alidToken): Invalid AMRMToken from appattem

Re: selecting columns with the same name in a join

2015-09-13 Thread Evert Lammerts
Thanks Michael, we'll update then. Evert On Sep 11, 2015 20:59, "Michael Armbrust" wrote: > Here is what I get on branch-1.5: > > x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() > y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF() > x.registerTempTabl

Re: UDAF and UDT with SparkSQL 1.5.0

2015-09-13 Thread jussipekkap
A small update: I was able to find a solution with good performance - using brickhouse collect (Hive UDAF). This also accept structs as an input, which is an ok workaround, but not perfect still (support for UDTs would be better). The built-in hive 'collect_list' seems to have a check for input par

How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102
Hi I am using UDF in hiveContext.sql("") query inside it uses group by which forces huge data shuffle read of around 30 GB I am thinking to convert above query into DataFrame so that I avoid using group by. How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much. -- View this mess

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-13 Thread Nick Pentreath
I should point out that I'm not sure what the performance of that project is. I'd expect that native data frame in PySpark will be significantly more efficient than their DictRDD.  It would be interesting to see a performance comparison for the pipelines relative to native Spark ML pipeli

Re: What happens when cache is full?

2015-09-13 Thread Mailinglisten
Hello Jeff, I'm quite new in the Spark topic but as far as I understood caching Spark uses the available memory and if more memory is requested cached RDDs are thrown away in a LRU manner and will be recomputed when needed. Please correct me if I'm wrong Regards Lars > Am 13.09.2015 um 05:1

Stopping SparkContext and HiveContext

2015-09-13 Thread Ophir Cohen
Hi, I'm working on my companie's system that constructs out of Spark, Zeppelin, Hive and some other technology and wonder regarding to ability to stop contexts. Working on the test framwork for the system, when run tests someting I would like to create new SparkContext in order to run the tests on

Re: Data lost in spark streaming

2015-09-13 Thread Tathagata Das
Maybe the driver got restarted. See the log4j logs of the driver before it restarted. On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > I'm using spark streaming 1.4.0 and have a DStream that have all the data > it received. But today the history data in the DStream seems to be lost > suddenly

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen Lee
Hi Owen Thank you for reply I heard that some peoples say ORC is Owen’s RC file haha ;) And, Some peoples tells to me after posting it’s already known issues about AWS EMR 4.0.0 They said that it might be Hive 0.13.1 and Spark 1.4.1 compatibility issue So AWS will launch EMR 4.1.0 in couple of

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen
Stacktrace are below. But someone told me that it's known issue and will be patched in couple of weeks(EMR 4.1.) So, dont' mind about that. I'll waiting until patched. scala> val ORCFile = sqlContext.read.format("orc").load("s3n://S3bucketName/S3serviceCode/yymmdd=20150801/country=eu/75e91