RE: HBase HTable constructor hangs

2015-04-29 Thread Tridib Samanta
I turned on the TRACE and I see lot of following exception: java.lang.IllegalAccessError: com/google/protobuf/ZeroCopyLiteralByteString at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:897) at

spark with standalone HBase

2015-04-29 Thread Saurabh Gupta
Hi, I am working with standalone HBase. And I want to execute HBaseTest.scala (in scala examples) . I have created a test table with three rows and I just want to get the count using HBaseTest.scala I am getting this issue: 15/04/29 11:17:10 INFO BlockManagerMaster: Registered BlockManager

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Or multiple volumes. The LOCAL_DIRS (YARN) and SPARK_LOCAL_DIRS (Mesos, Standalone) environment variables and the spark.local.dir property control where temporary data is written. The default is /tmp. See http://spark.apache.org/docs/latest/configuration.html#runtime-environment for more details.

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Makes sense. / is where /tmp would be. However, 230G should be plenty of space. If you have INFO logging turned on (set in $SPARK_HOME/conf/log4j.properties), you'll see messages about saving data to disk that will list sizes. The web console also has some summary information about this. dean

How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
Hi all, I execute simple SQL query and got a unacceptable performance. I do the following steps: 1. Apply a schema to an RDD and register table. 2. Run sql query which returns several entries: Running time for this query 0.2s (table contains 10 entries). I think that Spark SQL

java.io.IOException: No space left on device

2015-04-29 Thread Selim Namsi
Hi, I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf output,the training data is a file containing 156060 (size 8.1M). The problem is that when trying to presist a partition into memory and there is not enought memory, the partition is persisted on disk and despite Having

How to group multiple row data ?

2015-04-29 Thread bipin
Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event, CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet out of them such that I can infer that Customer registered event from 1to 2 and if

Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
Hi everyone, what is the most efficient way to filter a DataFrame on a column from another Dataframe's column. The best idea I had, was to join the two dataframes : val df1 : Dataframe val df2: Dataframe df1.join(df2, df1(id) === df2(id), inner) But I end up (obviously) with the id column

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
Please use user@, not dev@ This message does not appear to be from your driver. It also doesn't say you ran out of memory. It says you didn't tell YARN to let it use the memory you want. Look at the memory overhead param and please search first for related discussions. On Apr 29, 2015 11:43 AM,

Re: A problem of using spark streaming to capture network packets

2015-04-29 Thread Dean Wampler
I would use the ps command on each machine while the job is running to confirm that every process involved is running as root. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

Re: java.io.IOException: No space left on device

2015-04-29 Thread Anshul Singhle
Do you have multiple disks? Maybe your work directory is not in the right disk? On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi selim.na...@gmail.com wrote: Hi, I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf output,the training data is a file containing 156060 (size

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi
This is the output of df -h so as you can see I'm using only one disk mounted on / df -h Filesystem Size Used Avail Use% Mounted on /dev/sda8 276G 34G 229G 13% /none4.0K 0 4.0K 0% /sys/fs/cgroup udev7.8G 4.0K 7.8G 1% /dev tmpfs 1.6G

MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00

Re: Driver memory leak?

2015-04-29 Thread Serega Sheypak
The memory leak could be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. @Sean Will it be backported to CDH? I did't find that bug in CDH 5.4 release notes. 2015-04-29 14:51 GMT+02:00 Conor Fennell conor.fenn...@altocloud.com:

Re: How to group multiple row data ?

2015-04-29 Thread Manoj Awasthi
Sorry but I didn't fully understand the grouping. This line: The group must only take the closest previous trigger. The first one hence shows alone. Can you please explain further? On Wed, Apr 29, 2015 at 4:42 PM, bipin bipin@gmail.com wrote: Hi, I have a ddf with schema (CustomerID,

Re: java.io.IOException: No space left on device

2015-04-29 Thread selim namsi
Sorry I put the log messages when creating the thread in http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-td22702.html but I forgot that raw messages will not be sent in emails. So this is the log related to the error : 15/04/29 02:48:50 INFO

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread ayan guha
You can use .select to project only columns you need On Wed, Apr 29, 2015 at 9:23 PM, Olivier Girardot ssab...@gmail.com wrote: Hi everyone, what is the most efficient way to filter a DataFrame on a column from another Dataframe's column. The best idea I had, was to join the two dataframes :

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga sammiest...@gmail.com wrote: I'm mostly using example code, see here:

DataFrame filter referencing error

2015-04-29 Thread Francesco Bigarella
Hi all, I was testing the DataFrame filter functionality and I found what I think is a strange behaviour. My dataframe testDF, obtained loading aMySQL table via jdbc, has the following schema: root | -- id: long (nullable = false) | -- title: string (nullable = true) | -- value: string

Re: Driver memory leak?

2015-04-29 Thread Conor Fennell
The memory leak could be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. It also was a HashMap causing the issue. -Conor On Wed, Apr 29, 2015 at 12:01 PM, Sean Owen so...@cloudera.com wrote: Please use user@, not dev@

Re: DataFrame filter referencing error

2015-04-29 Thread ayan guha
Looks like you DF is based on a MySQL DB using jdbc, and error is thrown from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark is pushing down the predicate to mysql so its not a spark problem perse On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Yes, and Kafka topics are basically queues. So perhaps what's needed is just KafkaRDD with starting offset being 0 and finish offset being a very large number... Sent from my iPhone On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote: I guess what you mean is not streaming.

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread bit1...@163.com
Correct myself: For the SparkContext#wholeTextFile, the RDD's elements are kv pairs, the key is the file path, and the value is the file content So,for the SparkContext#wholeTextFile, the RDD has already carried the file information. bit1...@163.com From: Saisai Shao Date: 2015-04-29 15:50

Re: ReduceByKey and sorting within partitions

2015-04-29 Thread Marco
On 04/27/2015 06:00 PM, Ganelin, Ilya wrote: Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index

How to set DEBUG level log of spark executor on Standalone deploy mode

2015-04-29 Thread eric wong
Hi, I want to check the DEBUG log of spark executor on Standalone deploy mode. But, 1. Set log4j.properties in spark/conf folder on master node and restart cluster. no means above works. 2. usning spark-submit --properties-file log4j. Just print debug log to screen but executor log still seems to

Equal Height and Depth Binning in Spark

2015-04-29 Thread kundan kumar
Hi, I am trying to implement equal depth and equal height binning methods in spark. Any insights, existing code for this would be really helpful. Thanks, Kundan

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Saisai Shao
Yes, looks like a solution but quite tricky. You have to parse the debug string to get the file name, also relies on HadoopRDD to get the file name :) 2015-04-29 14:52 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com: It is possible to access the filename, its a bit tricky though. val fstream

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Akhil Das
It is possible to access the filename, its a bit tricky though. val fstream = ssc.fileStream[LongWritable, IntWritable, SequenceFileInputFormat[LongWritable, IntWritable]](/home/akhld/input/) fstream.foreach(x ={ //You can get it with this object.

Re: Question about Memory Used and VCores Used

2015-04-29 Thread Sandy Ryza
Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the application master, and the way the YARN rounds requests up. This explains it in a little more detail: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread ayan guha
I guess what you mean is not streaming. If you create a stream context at time t, you will receive data coming through starting time t++, not before time t. Looks like you want a queue. Let Kafka write to a queue, consume msgs from the queue and stop when queue is empty. On 29 Apr 2015 14:35,

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread selim namsi
Thank you for your Answer! Yes I would like to work on it. Selim On Mon, Apr 27, 2015 at 5:23 AM Joseph Bradley jos...@databricks.com wrote: Unfortunately, the Pipelines API doesn't have multiclass logistic regression yet, only binary. It's really a matter of modifying the current

Re: How to setup this false streaming problem

2015-04-29 Thread Ignacio Blasco
Hi Toni. Given there is more than one measure by (user, hour) what is the measure you want to keep? The sum?, the mean?, the most recent measure?. For the sum or the mean you don't need to care about the timing. And If you wan't to have the most recent then you can include the timestamp in the

Re: Re: Question about Memory Used and VCores Used

2015-04-29 Thread bit1...@163.com
Thanks Sandy, it is very useful! bit1...@163.com From: Sandy Ryza Date: 2015-04-29 15:24 To: bit1...@163.com CC: user Subject: Re: Question about Memory Used and VCores Used Hi, Good question. The extra memory comes from spark.yarn.executor.memoryOverhead, the space used for the

Re: sparksql - HiveConf not found during task deserialization

2015-04-29 Thread Manku Timma
The issue is solved. There was a problem in my hive codebase. Once that was fixed, -Phive-provided spark is working fine against my hive jars. On 27 April 2015 at 08:00, Manku Timma manku.tim...@gmail.com wrote: Made some progress on this. Adding hive jars to the system classpath is needed.

Re: Parquet error reading data that contains array of structs

2015-04-29 Thread Cheng Lian
Thanks for the detailed information! Now I can confirm that this is a backwards-compatibility issue. The data written by parquet 1.6rc7 follows the standard LIST structure. However, Spark SQL still uses old parquet-avro style two-level structures, which causes the problem. Cheng On 4/27/15

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai
Wrapping the old LogisticRegressionWithLBFGS could be a quick solution for 1.4, and it's not too hard so it's potentially to get into 1.4. In the long term, I will like to implement a new version like https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef which handles the

HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs

Re: Spark on Cassandra

2015-04-29 Thread Cody Koeninger
Hadoop version doesn't matter if you're just using cassandra. On Wed, Apr 29, 2015 at 12:08 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I am new to Spark, but excited to use it with our Cassandra cluster. I have read in a few places that Spark can interact directly with

Re: Driver memory leak?

2015-04-29 Thread Sean Owen
Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 (This isn't the place to ask about CDH) I also don't think that's the problem. The process did not run out of memory. On Wed, Apr 29, 2015 at 2:08 PM, Serega Sheypak serega.shey...@gmail.com wrote: The memory leak could be related to

Spark on Cassandra

2015-04-29 Thread Matthew Johnson
Hi all, I am new to Spark, but excited to use it with our Cassandra cluster. I have read in a few places that Spark can interact directly with Cassandra now, so I decided to download it and have a play – I am happy to run it in standalone cluster mode initially. When I go to download it (

Re: How Spark SQL supports primary and secondary indexes

2015-04-29 Thread Nikolay Tikhonov
I'm running this query with different parameter on the same RDD and got 0.2s for each query. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-Spark-SQL-supports-primary-and-secondary-indexes-tp22700p22706.html Sent from the Apache Spark User List

Re: indexing an RDD [Python]

2015-04-29 Thread Sven Krasser
Hey Roberto, You will likely want to use a cogroup() then, but it hinges all on how your data looks, i.e. if you have the index in the key. Here's an example: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#cogroup . Clone: RDDs are immutable, so if you need to make changes

Compute pairwise distance

2015-04-29 Thread Driesprong, Fokko
Dear Sparkers, I am working on an algorithm which requires the pair distance between all points (eg. DBScan, LOF, etc.). Computing this for *n* points will require produce a n^2 matrix. If the distance measure is symmetrical, this can be reduced to (n^2)/2. What would be the most optimal way of

Re: Slower performance when bigger memory?

2015-04-29 Thread Sven Krasser
On Mon, Apr 27, 2015 at 7:36 AM, Shuai Zheng szheng.c...@gmail.com wrote: Thanks. So may I know what is your configuration for more/smaller executors on r3.8xlarge, how big of the memory that you eventually decide to give one executor without impact performance (for example: 64g? ). We're

Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
Hi, I have a 2 billion records dataset witch schema eventId: String, time: Double, value: Double. It is stored in Parquet format in HDFS, size 23GB. Specs: Spark 1.3, Hadoop 1.2.1, 8 nodes with Xeon 16GB RAM, 1TB disk space, each node has 3 workers with 3GB memory. I keep failing to sort the

Performance advantage by loading data from local node over S3.

2015-04-29 Thread Nisrina Luthfiyati
Hi all, I'm new to Spark so I'm sorry if the question is too vague. I'm currently trying to deploy a Spark cluster using YARN on an amazon EMR cluster. For the data storage I'm currently using S3 but would loading the data in HDFS from local node gives considerable performance advantage over

RE: Spark on Cassandra

2015-04-29 Thread Huang, Roger
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/ http://planetcassandra.org/blog/holy-momentum-batman-spark-and-cassandra-circa-2015-w-datastax-connector-and-java/ https://github.com/datastax/spark-cassandra-connector From: Cody Koeninger [mailto:c...@koeninger.org]

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Thanks for the responses. Try removing toDebugString and see what happens. The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context:

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Richard Marscher
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Hi, I am trying to see exactly what happens underneath the hood of Spar when performing a simple sortByKey. So far I've already discovered the fetch-files and both the temp-shuffle and shuffle files being written to disk, but there is still an extra stage that keeps on puzzling me. This is the

Hardware provisioning for Spark SQl

2015-04-29 Thread Pietro Gentile
Hi all, I have to estimate resource requirements for my hadoop/spark cluster. In particular, i have to query about 100tb of hbase table to do aggregation with spark sql. What is, approximately, the most suitable cluster configuration for my use case? In order to query data in a fast way. At last

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka should not open so many files. Do you think there is possible socket leakage? BTW, in every batch which is 5 seconds, I output some results to mysql: def ingestToMysql(data:

multiple programs compilation by sbt.

2015-04-29 Thread Dan Dong
Hi, Following the Quick Start guide: https://spark.apache.org/docs/latest/quick-start.html I could compile and run a Spark program successfully, now my question is how to compile multiple programs with sbt in a bunch. E.g, two programs as: ./src ./src/main ./src/main/scala

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Can you run the command 'ulimit -n' to see the current limit ? To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf* Cheers On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka

Re: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread ayan guha
Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example:

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Is the function ingestToMysql running on the driver or on the executors? Accordingly you can try debugging while running in a distributed manner, with and without calling the function. If you dont get too many open files without calling ingestToMysql(), the problem is likely to be in

Re: Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Tathagata Das
Have you taken a look at the join section in the streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Let say I have transaction data and

Kryo serialization of classes in additional jars

2015-04-29 Thread Akshat Aranya
Hi, Is it possible to register kryo serialization for classes contained in jars that are added with spark.jars? In my experiment it doesn't seem to work, likely because the class registration happens before the jar is shipped to the executor and added to the classloader. Here's the general idea

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
This function is called in foreachRDD. I think it should be running in the executors. I add the statement.close() in the code and it is running. I will let you know if this fixes the issue. On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl

Re: multiple programs compilation by sbt.

2015-04-29 Thread Dan Dong
HI, Ted, I will have a look at it , thanks a lot. Cheers, Dan 2015年4月29日 下午5:00于 Ted Yu yuzhih...@gmail.com写道: Have you looked at http://www.scala-sbt.org/0.13/tutorial/Multi-Project.html ? Cheers On Wed, Apr 29, 2015 at 2:45 PM, Dan Dong dongda...@gmail.com wrote: Hi,

Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
I believe that the implicit def is pulling in the enclosing class (in which the def is defined) in the closure which is not serializable. On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi guys, I`m puzzled why i cant use the implicit function in

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Also cc;ing Cody. @Cody maybe there is a reason for doing connection pooling even if there is not performance difference. TD On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql running on the driver or on the executors? Accordingly you can

Re: Driver memory leak?

2015-04-29 Thread Tathagata Das
It could be related to this. https://issues.apache.org/jira/browse/SPARK-6737 This was fixed in Spark 1.3.1. On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen so...@cloudera.com wrote: Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 (This isn't the place to ask about CDH) I also

Re: multiple programs compilation by sbt.

2015-04-29 Thread Ted Yu
Have you looked at http://www.scala-sbt.org/0.13/tutorial/Multi-Project.html ? Cheers On Wed, Apr 29, 2015 at 2:45 PM, Dan Dong dongda...@gmail.com wrote: Hi, Following the Quick Start guide: https://spark.apache.org/docs/latest/quick-start.html I could compile and run a Spark program

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Part of the issues is, when you read messages in a topic, the messages are peeked, not polled, so there'll be no when the queue is empty, as I understand it. So it would seem I'd want to do KafkaUtils.createRDD, which takes an array of OffsetRange's. Each OffsetRange is characterized by topic,

Re: HBase HTable constructor hangs

2015-04-29 Thread Ted Yu
Can you verify whether the hbase release you're using has the following fix ? HBASE-8 non environment variable solution for IllegalAccessError Cheers On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta tridib.sama...@live.com wrote: I turned on the TRACE and I see lot of following exception:

Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Rendy Bambang Junior
Let say I have transaction data and visit data visit | userId | Visit source | Timestamp | | A | google ads | 1 | | A | facebook ads | 2 | transaction | userId | total price | timestamp | | A | 100 | 248384| | B | 200 | 43298739 | I

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Cody Koeninger
The idea of peek vs poll doesn't apply to kafka, because kafka is not a queue. There are two ways of doing what you want, either using KafkaRDD or a direct stream The Kafka rdd approach would require you to find the beginning and ending offsets for each partition. For an example of this, see

Re: How to group multiple row data ?

2015-04-29 Thread ayan guha
looks like you need this: lst = [[10001, 132, 2002, 1, 2012-11-23], [10001, 132, 2002, 1, 2012-11-24], [10031, 102, 223, 2, 2012-11-24], [10001, 132, 2002, 2, 2012-11-25], [10001, 132, 2002, 3, 2012-11-26]] base = sc.parallelize(lst,1).map(lambda x:

RE: How to group multiple row data ?

2015-04-29 Thread Silvio Fiorito
I think you'd probably want to look at combineByKey. I'm on my phone so can't give you an example, but that's one solution i would try. You would then take the resulting RDD and go back to a DF if needed. From: bipinmailto:bipin@gmail.com Sent: ‎4/‎29/‎2015

Re: Re: solr in spark

2015-04-29 Thread Costin Leau
# disclaimer I'm an employee of Elastic (the company behind Elasticsearch) and lead of Elasticsearch Hadoop integration Some things to clarify on the Elasticsearch side: 1. Elasticsearch is a distributed, real-time search and analytics engine. Search is just one aspect of it and it can work

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
You mean after joining ? Sure, my question was more if there was any best practice preferred to joining the other dataframe for filtering. Regards, Olivier. Le mer. 29 avr. 2015 à 13:23, Olivier Girardot ssab...@gmail.com a écrit : Hi everyone, what is the most efficient way to filter a

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
Could you put the implicit def in an object? That should work, as objects are never serialized. On Wed, Apr 29, 2015 at 6:28 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Thank you for your pointers , it`s very helpful to me , in this scenario how can i use the implicit def in

Re: [Spark SQL] Problems creating a table in specified schema/database

2015-04-29 Thread Michael Armbrust
No, sorry this is not supported. Support for more than one database is lacking in several areas (though mostly works for hive tables). I'd like to fix this in Spark 1.5. On Tue, Apr 28, 2015 at 1:54 AM, James Aley james.a...@swiftkey.com wrote: Hey all, I'm trying to create tables from

RE: Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
After day of debugging (actually, more), I can answer my question: The problem is that the default value 200 of spark.sql.shuffle.partitions is too small for sorting 2B rows. It was hard to realize because Spark executors just crash with various exceptions one by one. The other takeaway is that

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread guoqing0...@yahoo.com.hk
Thank you for your pointers , it`s very helpful to me , in this scenario how can i use the implicit def in the enclosing class ? From: Tathagata Das Date: 2015-04-30 07:00 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: implicit function in SparkStreaming I believe that the implicit def is

spark kryo serialization question

2015-04-29 Thread 邓刚 [技术中心]
Hi all We know that spark support Kryo serialization, suppose there is a map function which map C to K,V(here C,K,V are instance of class C,K,V), when we register kryo serialization, should I register all of these three class? Best Wishes Triones Deng

Re: How to install spark in spark on yarn mode

2015-04-29 Thread madhvi
Hi, Follow the instructions to install on the following link: http://mbonaci.github.io/mbo-spark/ You dont need to install spark on every node.Just install it on one node or you can install it on remote system also and made a spark cluster. Thanks Madhvi On Thursday 30 April 2015 09:31 AM,

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Lin Hao Xu
For you question, I think the discussion in this link can help. http://apache-spark-user-list.1001560.n3.nabble.com/Error-related-to-serialisation-in-spark-streaming-td6801.html Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr:

Event generator for SPARK-Streaming from csv

2015-04-29 Thread anshu shukla
I have the real DEBS-TAxi data in csv file , in order to operate over it how to simulate a Spout kind of thing as event generator using the timestamps in CSV file. -- SERC-IISC Thanks Regards, Anshu Shukla

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1,

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Cody Koeninger
Use lsof to see what files are actually being held open. That stacktrace looks to me like it's from the driver, not executors. Where in foreach is it being called? The outermost portion of foreachRDD runs in the driver, the innermost portion runs in the executors. From the docs:

How to install spark in spark on yarn mode

2015-04-29 Thread xiaohe lan
Hi experts, I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5 nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try the spark on yarn mode. Do I need to install spark on the each node of hadoop cluster ? Thanks, Xiaohe

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread guoqing0...@yahoo.com.hk
Appreciate for your help , it works . i`m curious why the enclosing class cannot serialized , is it need to extends java.io.Serializable ? if object never serialized how it works in the task .whether there`s any association with the spark.closure.serializer . guoqing0...@yahoo.com.hk

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Thanks for the comments, Cody. Granted, Kafka topics aren't queues. I was merely wishing that Kafka's topics had some queue behaviors supported because often that is exactly what one wants. The ability to poll messages off a topic seems like what lots of use-cases would want. I'll explore both

Re: solr in spark

2015-04-29 Thread Costin Leau
On 4/29/15 6:02 PM, Jeetendra Gangele wrote: Thanks for detail explanation. My only worry is to search the all combinations of company names through ES looks hard. I'm not sure what makes you think ES looks hard. Have you tried browsing the Elasticsearch reference or the definitive guide?