Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Gautam Bajaj
That is completely alright, as the system will make sure the works get done. My major concern is, the data drop. Will using async stop data loss? On Thu, May 21, 2015 at 4:55 PM, Tathagata Das t...@databricks.com wrote: If you cannot push data as fast as you are generating it, then async isnt

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Tathagata Das
Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I ran into one of the issues that are potentially caused because of this and have logged a JIRA bug - https://issues.apache.org/jira/browse/SPARK-7788

Re: spark mllib kmeans

2015-05-21 Thread Pa Rö
i want evaluate some different distance measure for time-space clustering. so i need a api for implement my own function in java. 2015-05-19 22:08 GMT+02:00 Xiangrui Meng men...@gmail.com: Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you share the code, may be i/someone can help you out Thanks Best Regards On Thu, May 21, 2015 at 1:45 PM, Allan Jie allanmcgr...@gmail.com wrote: Hi, Just check the logs of datanode, it looks like this: 2015-05-20 11:42:14,605 INFO

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
If you cannot push data as fast as you are generating it, then async isnt going to help either. The work is just going to keep piling up as many many async jobs even though your batch processing times will be low as that processing time is not going to reflect how much of overall work is pending

Question about Serialization in Storage Level

2015-05-21 Thread Jiang, Zhipeng
Hi there, This question may seem to be kind of naïve, but what's the difference between MEMORY_AND_DISK and MEMORY_AND_DISK_SER? If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won't serialize the rdd? Thanks, Zhipeng

Re: Spark and Flink

2015-05-21 Thread Pa Rö
thanks a lot for ur help, now i split my project, it's works. 2015-05-19 15:44 GMT+02:00 Alexander Alexandrov alexander.s.alexand...@gmail.com: Sorry, we're using a forked version which changed groupID. 2015-05-19 15:15 GMT+02:00 Till Rohrmann trohrm...@apache.org: I guess it's a typo:

Re: How to process data in chronological order

2015-05-21 Thread Sonal Goyal
Would partitioning your data based on the key and then running mapPartitions help? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, May 21, 2015 at 4:33 AM, roy rp...@njit.edu wrote: I have a key-value RDD, key is a timestamp

Unable to use hive queries with constants in predicates

2015-05-21 Thread Devarajan Srinivasan
Hi, I was testing spark to read data from hive using HiveContext. I got the following error, when I used a simple query with constants in predicates. I am using spark 1.3*. *Anyone encountered error like this ?? *Error:* Exception in thread main org.apache.spark.sql.AnalysisException:

Re: How to set the file size for parquet Part

2015-05-21 Thread Akhil Das
How many part files are you having? Did you try re-partitioning to a smaller number so that you will have bigger files of smaller number. Thanks Best Regards On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com wrote: Hi I'm using spark 1.3.1 and now I can't set the size of

Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory. We have modified the sparkstreaming code base to read nested files from S3, you can check this function

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Keerthi
Hi , I had tried the workaround shared here, but still facing the same issue... Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-saveAsTextFile-problem-tp176p22970.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: FP Growth saveAsTextFile

2015-05-21 Thread Xiangrui Meng
+user If this was in cluster mode, you should provide a path on a shared file system, e.g., HDFS, instead of a local path. If this is in local model, I'm not sure what went wrong. On Wed, May 20, 2015 at 2:09 PM, Eric Tanner eric.tan...@justenough.com wrote: Here is the stack trace. Thanks for

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
This thread happened a year back, can you please share what issue you are facing? which version of spark you are using? What is your system environment? Exception stack-trace? Thanks Best Regards On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com wrote: Hi , I had tried

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Allan Jie
Sure, the code is very simple. I think u guys can understand from the main function. public class Test1 { public static double[][] createBroadcastPoints(String localPointPath, int row, int col) throws IOException{ BufferedReader br = RAWF.reader(localPointPath); String line = null; int rowIndex

Re: java program got Stuck at broadcasting

2015-05-21 Thread allanjie
Sure, the code is very simple. I think u guys can understand from the main function. public class Test1 { public static double[][] createBroadcastPoints(String localPointPath, int row, int col) throws IOException{ BufferedReader br = RAWF.reader(localPointPath);

?????? How to use spark to access HBase with Security enabled

2015-05-21 Thread donhoff_h
Hi, Many thanks for the help. My Spark version is 1.3.0 too and I run it on Yarn. According to your advice I have changed the configuration. Now my program can read the hbase-site.xml correctly. And it can also authenticate with zookeeper successfully. But I meet a new problem that is my

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for reading and replying. However, I have a follow-up question: I don't think if I understand the block replication completely. Are the blocks replicated immediately after they are received by the receiver? Or are they

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
never mind... i didnt realize you were referring to the first table as df. so you want to do a join between the first table and an RDD? the right way to do it within the data frame construct is to think of it as a join... map the second RDD to a data frame and do an inner join on ip On Thu, May

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of Tachyon's FileSystem interface, is returning zero. On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Just to follow up this thread further . I was doing some fault tolerant testing

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
Thanks. I suspected that, but figured that df query inside a map sounds so intuitive that I don't just want to give up. I've tried join and even better with a DStream.transform() and it works! freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y): y[1])) Thank you for the help!

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
I don't need to be 100% randome. How about randomly pick a few partitions and return all docs in those partitions? Is rdd.mapPartitionsWithIndex() the right method to use to just process a small portion of partitions? Ningjun -Original Message- From: Sean Owen

Pipelining with Spark

2015-05-21 Thread dgoldenberg
From the performance and scalability standpoint, is it better to plug in, say a multi-threaded pipeliner into a Spark job, or implement pipelining via Spark's own transformation mechanisms such as e.g. map or filter? I'm seeing some reference architectures where things like 'morphlines' are

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska
I have not seen this error but have seen another user have weird parser issues before: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E I would attach a debugger and see what is going on -- if I'm looking

Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist
From the docs, https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence: Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread tshah77
Can some one provide example of Spark Streaming using Java? I have cassandra running but did not configure spark but would like to create Dstream. Thanks -- View this message in context:

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
hi. I have a spark streaming - cassandra application which you can probably borrow pretty easily. You can always rewrite a part of it in java if you need to , or else, you can just use scala (see the blog post below if you want a java style dev workflow w/ scala using intellij)/ This application

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread Tathagata Das
Doesnt seem like a Cassandra specific issue. Could you give us more information (code, errors, stack traces)? On Thu, May 21, 2015 at 1:33 PM, tshah77 tejasrs...@gmail.com wrote: TD, Do you have any example about reading from cassandra using spark streaming in java? I am trying to connect

foreach vs foreachPartitions

2015-05-21 Thread ben
I would like to know if the foreachPartitions will results in a better performance, due to an higher level of parallelism, compared to the foreach method considering the case in which I'm flowing through an RDD in order to perform some sums into an accumulator variable. Thank you, Beniamino.

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread tshah77
TD, Do you have any example about reading from cassandra using spark streaming in java? I am trying to connect to cassandra using spark streaming and it is throwing an error as could not parse master url. Thanks Tejas -- View this message in context:

Re: How to use spark to access HBase with Security enabled

2015-05-21 Thread Ted Yu
Are the worker nodes colocated with HBase region servers ? Were you running as hbase super user ? You may need to login, using code similar to the following: if (isSecurityEnabled()) { SecurityUtil.login(conf, fileConfKey, principalConfKey, localhost); } SecurityUtil is

foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread ben
Hi, everybody. There are some cases in which I can obtain the same results by using the mapPartitions and the foreach method. For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transform the original RDD in a collection of tuple

Spark MOOC - early access

2015-05-21 Thread Marco Shaw
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing two Spark-related MOOC on edX (intro https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x, ml https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x), the first of which

Re: How to use spark to access HBase with Security enabled

2015-05-21 Thread Bill Q
What I found with the CDH-5.4.1 Spark 1.3, the spark.executor.extraClassPath setting is not working. Had to use SPARK_CLASSPATH instead. On Thursday, May 21, 2015, Ted Yu yuzhih...@gmail.com wrote: Are the worker nodes colocated with HBase region servers ? Were you running as hbase super user

Re: Spark HistoryServer not coming up

2015-05-21 Thread roy
This got resolved after cleaning /user/spark/applicationHistory/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

running spark on yarn

2015-05-21 Thread Nathan Kronenfeld
Hello, folks. We just recently switched to using Yarn on our cluster (when upgrading to cloudera 5.4.1) I'm trying to run a spark job from within a broader application (a web service running on Jetty), so I can't just start it using spark-submit. Does anyone know of an instructions page on how

Re: PySpark Logs location

2015-05-21 Thread Oleg Ruchovets
Doesn't work for me so far , using command but got such output. What should I check to fix the issue? Any configuration parameters ... [root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId application_1426424283508_0048 15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service

DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
Hi Spark Users Group, I’m doing groupby operations on my DataFrame *df* as following, to get count for each value of col1: df.groupBy(col1).agg(col1 - count).show // I don't know if I should write like this. col1 COUNT(col1#347) aaa2 bbb4 ccc4 ... and more... As I ‘d like to

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
On Thu, May 21, 2015 at 4:17 PM, Howard Yang howardyang2...@gmail.com wrote: follow http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os to build latest version Hadoop in my windows machine, and Add Environment Variable *HADOOP_HOME* and

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-21 Thread Tomasz Fruboes
Hi, thanks for answer, I'll open a ticket. In the meantime - I have found a workaround. The recipe is the following: 1. Create a new account/group on all machines (lets call it sparkuser). Run spark from this account. 2. Add your user to group sparkuser. 3. If you decide to write

Re: DataFrame Column Alias problem

2015-05-21 Thread Ram Sriharsha
df.groupBy($col1).agg(count($col1).as(c)).show On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu sliznmail...@gmail.com wrote: Hi Spark Users Group, I’m doing groupby operations on my DataFrame *df* as following, to get count for each value of col1: df.groupBy(col1).agg(col1 - count).show // I

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-21 Thread Grega Kešpret
Hi, is this fixed in master? Grega On Thu, May 14, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com wrote: End of the month is the target: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh

saveAsTextFile() part- files are missing

2015-05-21 Thread rroxanaioana
Hello! I just started with Spark. I have an application which counts words in a file (1 MB file). The file is stored locally. I loaded the file using native code and then created the RDD from it. JavaRDDString rddFromFile = context.parallelize(myFile, 2);

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
Is there any other way to solve the problem? Let me state the use case I have an RDD[Document] contains over 7 millions items. The RDD need to be save on a persistent storage (currently I save it as object file on disk). Then I need to get a small random sample of Document objects (e.g. 10,000

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of a MapFile which is like a SequenceFile with an index of offsets into the file where records being. Although Spark doesn't use it, you could maybe create

map reduce ?

2015-05-21 Thread Yasemin Kaya
Hi, I have JavaPairRDDString, ListInteger and as an example what I want to get. user_id cat1 cat2 cat3 cat4 522 0 1 2 0 62 1 0 3 0 661 1 2 0 1 query : the users who have a number (except 0) in cat1 and cat3 column answer: cat2 - 522,611 cat3-522,62 = user 522 How can I

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Allan Jie
Hey, I think I found out the problem. Turns out that the file I saved is too large. On 21 May 2015 at 16:44, Akhil Das ak...@sigmoidanalytics.com wrote: Can you share the code, may be i/someone can help you out Thanks Best Regards On Thu, May 21, 2015 at 1:45 PM, Allan Jie

Pandas timezone problems

2015-05-21 Thread Def_Os
After deserialization, something seems to be wrong with my pandas DataFrames. It looks like the timezone information is lost, and subsequent errors ensue. Serializing and deserializing a timezone-aware DataFrame tests just fine, so it must be Spark that somehow changes the data. My program runs

Re: Pandas timezone problems

2015-05-21 Thread Xiangrui Meng
These are relevant: JIRA: https://issues.apache.org/jira/browse/SPARK-6411 PR: https://github.com/apache/spark/pull/6250 On Thu, May 21, 2015 at 3:16 PM, Def_Os njde...@gmail.com wrote: After deserialization, something seems to be wrong with my pandas DataFrames. It looks like the timezone

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented

Re: Spark MOOC - early access

2015-05-21 Thread Kartik Mehta
Awesome, Thanks a ton for helping us all and futuristic planning, Much appreciate it, Regards, Kartik On May 21, 2015 4:41 PM, Marco Shaw marco.s...@gmail.com wrote: *Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing two Spark-related MOOC on edX

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Davies Liu
Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote: Hi all, I am running the Python process that communicates with

task all finished, while the stage marked finish long time later problem

2015-05-21 Thread 邓刚 [技术中心]
Hi all, We are running spark streaming with version 1.1.1. recently we found an odd problem. In stage 44554, All the task finished, but the stage marked finished long time later, as you can see the log below, the last task finished @15/05/21 21:17:36 And also the

Kmeans Labeled Point RDD

2015-05-21 Thread anneywarlord
Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would like to be able to identify which observations were grouped with each centroid. Thanks -- View this message in context:

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers k/ On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com wrote: Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans.

Re: foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread Burak Yavuz
Or you can simply use `reduceByKeyLocally` if you don't want to worry about implementing accumulators and such, and assuming that the reduced values will fit in memory of the driver (which you are assuming by using accumulators). Best, Burak On Thu, May 21, 2015 at 2:46 PM, ben

Re: PySpark Logs location

2015-05-21 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Dibyendu Bhattacharya
Hi Tathagata, Thanks for looking into this. Further investigating I found that the issue is with Tachyon does not support File Append. The streaming receiver which writes to WAL when failed, and again restarted, not able to append to same WAL file after restart. I raised this with Tachyon user

LDA prediction on new document

2015-05-21 Thread Dani Qiu
Hi, guys, I'm pretty new to LDA. I notice spark 1.3.0 mllib provide EM based LDA implementation. It returns both topics and topic distribution. My question is how can I use these parameters to predict on new document ? And I notice there is an Online LDA implementation in spark master branch,

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
Your original code snippet seems incomplete and there isn't enough information to figure out what problem you actually ran into from your original code snippet there is an rdd variable which is well defined and a df variable that is not defined in the snippet of code you sent one way to make

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster. On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com

Official Docker container for Spark

2015-05-21 Thread tridib
Hi, I am using spark 1.2.0. Can you suggest docker containers which can be deployed in production? I found lot of spark images in https://registry.hub.docker.com/ . But could not figure out which one to use. None of them seems like official image. Does anybody have any recommendation? Thanks

Re: java program got Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you try commenting the saveAsTextFile and do a simple count()? If its a broadcast issue, then it would throw up the same error. On 21 May 2015 14:21, allanjie allanmcgr...@gmail.com wrote: Sure, the code is very simple. I think u guys can understand from the main function. public class

Spark HistoryServer not coming up

2015-05-21 Thread roy
Hi, After restarting Spark HistoryServer, it failed to come up, I checked logs for Spark HistoryServer found following messages :' 2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus: Log path provided contains no log files. 2015-05-21 11:38:52,319 INFO

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-21 Thread Shixiong Zhu
My 2 cents: As per javadoc: https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread) Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the

Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes
Hi, it looks you are writing to a local filesystem. Could you try writing to a location visible by all nodes (master and workers), e.g. nfs share? HTH, Tomasz W dniu 21.05.2015 o 17:16, rroxanaioana pisze: Hello! I just started with Spark. I have an application which counts words in a

Re: Spark HistoryServer not coming up

2015-05-21 Thread Marcelo Vanzin
Seems like there might be a mismatch between your Spark jars and your cluster's HDFS version. Make sure you're using the Spark jar that matches the hadoop version of your cluster. On Thu, May 21, 2015 at 8:48 AM, roy rp...@njit.edu wrote: Hi, After restarting Spark HistoryServer, it failed

Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
I have a dataframe as a reference table for IP frequencies. e.g., ip freq 10.226.93.67 1 10.226.93.69 1 161.168.251.101 4 10.236.70.2 1 161.168.251.105 14 All I need is to query the df in a map. rdd = sc.parallelize(['208.51.22.18',

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
Hi Cody, That is clear. Thanks! Bill On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger c...@koeninger.org wrote: If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Holden Karau
So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :) On Thursday, May 21, 2015, ping yan