Re: Spark support for Complex Event Processing (CEP)

2016-04-19 Thread Mich Talebzadeh
Thanks a lot Mario. Will have a look. Regards, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 20

Re: Spark support for Complex Event Processing (CEP)

2016-04-19 Thread Mario Ds Briggs
Hi Mich, Info is here - https://issues.apache.org/jira/browse/SPARK-14745 overview is in the pdf - https://issues.apache.org/jira/secure/attachment/12799670/SparkStreamingCEP.pdf Usage examples not in the best place for now (will make it better) -

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang
Seems it is OOM in driver side when fetching task result. You can try to increase spark.driver.memory and spark.driver.maxResultSize On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 wrote: > Hi Zhan Zhang > > > Please see the exception trace below. It is saying some GC overhead limit >

Re: VectorAssembler handling null values

2016-04-19 Thread Nick Pentreath
Could you provide an example of what your input data looks like? Supporting missing values in a sparse result vector makes sense. On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote: > Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot > handle null

Re:Re: Why very small work load cause GC overhead limit?

2016-04-19 Thread 李明伟
The memory parameters :--executor-memory 8G --driver-memory 4G. Please note that the data size is very small. Total size of the data is less than 10M As per jmap. It is a little hard for me to do so. I am not a java developer. I will google the jmap first, thanks Regards Mingwei

Re: Why very small work load cause GC overhead limit?

2016-04-19 Thread Ted Yu
Can you tell us the memory parameters you used ? If you can capture jmap before the GC limit was exceeded, that would give us more clue. Thanks > On Apr 19, 2016, at 7:40 PM, "kramer2...@126.com" wrote: > > Hi All > > I use spark doing some calculation. > The situation

Re: How does .jsonFile() work?

2016-04-19 Thread Hyukjin Kwon
Hi, I hope I understood correctly. This is a simplified procedures. Precondition - JSON file is written line by line. Each is each JSON document. - Root array is supported, eg. [{...}, {...} {...}] Procedures - Schema inference (If user schema is not given) 1.

Why very small work load cause GC overhead limit?

2016-04-19 Thread kramer2...@126.com
Hi All I use spark doing some calculation. The situation is 1. New file will come into a folder periodically 2. I turn the new files into data frame and insert it into an previous data frame. The code is like below : # Get the file list in the HDFS directory client =

How does .jsonFile() work?

2016-04-19 Thread resonance
Hi, this is more of a theoretical question and I'm asking it here because I have no idea where to find documentation for this stuff. I am currently working with Spark SQL and am considering using data contained within JSON datasets. I am aware of the .jsonFile() method in Spark SQL. What is the

Re: Any NLP lib could be used on spark?

2016-04-19 Thread Burak Yavuz
A quick search on spark-packages returns: http://spark-packages.org/package/databricks/spark-corenlp. You may need to build it locally and add it to your session by --jars. On Tue, Apr 19, 2016 at 10:47 AM, Gavin Yue wrote: > Hey, > > Want to try the NLP on the spark.

Re: Spark SQL Transaction

2016-04-19 Thread Andrés Ivaldi
I mean local transaction, We've ran a Job that writes into SQLServer then we killed spark JVM just for testing purpose and we realized that SQLServer did a rollback. Regards On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh wrote: > Hi, > > What do you mean by

Spark 1.6.1 DataFrame write to JDBC

2016-04-19 Thread Jonathan Gray
Hi, I'm trying to write ~60 million rows from a DataFrame to a database using JDBC using Spark 1.6.1, something similar to df.write().jdbc(...) The write seems to not be performing well. Profiling the application with a master of local[*] it appears there is not much socket write activity and

VectorAssembler handling null values

2016-04-19 Thread Andres Perez
Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot handle null values. This presents a problem for us as we wish to run a decision tree classifier on sometimes sparse data. Is there a particular reason VectorAssembler is implemented in this way, and can anyone recommend the

Re: Spark SQL Transaction

2016-04-19 Thread Mich Talebzadeh
Hi, What do you mean by *without transaction*? do you mean forcing SQL Server to accept a non logged operation? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark SQL Transaction

2016-04-19 Thread Andrés Ivaldi
Hello, is possible to execute a SQL write without Transaction? we dont need transactions to save our data and this adds an overhead to the SQLServer. Regards. -- Ing. Ivaldi Andres

how to get weights of logistic regression model inside cross validator model?

2016-04-19 Thread Wei Chen
Hi All, I am using the example of model selection via cross-validation from the documentation here: http://spark.apache.org/docs/latest/ml-guide.html. After I get the "cvModel", I would like to see the weights for each feature for the best logistic regression model. I've been looking at the

pyspark split pair rdd to multiple

2016-04-19 Thread pth001
Hi, How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in Pyspark? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
Let me be more detailed in my response: Kafka works on “at least once” semantics. Therefore, given your assumption that Kafka "will be operational", we can assume that at least once semantics will hold. At this point, it comes down to designing for consumer (really Spark Executor) resilience.

Any NLP lib could be used on spark?

2016-04-19 Thread Gavin Yue
Hey, Want to try the NLP on the spark. Could anyone recommend any easy to run NLP open source lib on spark? Also is there any recommended semantic network? Thanks a lot.

Re: prefix column Spark

2016-04-19 Thread Michael Armbrust
A few comments: - Each withColumnRename is adding a new level to the logical plan. We have optimized this significantly in newer versions of Spark, but it is still not free. - Transforming to an RDD is going to do fairly expensive conversion back and forth between the internal binary format. -

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
At that scale, it’s best not to do coordination at the application layer. How much of your data is transactional in nature {all, some, none}? By which I mean ACID-compliant. > On Apr 19, 2016, at 10:53 AM, Erwan ALLAIN wrote: > > Cody, you're right that was an

Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Ted Yu
Using http://www.ruddwire.com/handy-code/date-to-millisecond-calculators/#.VxZh3iMrKuo , 1460823008000 is shown to be 'Sat Apr 16 2016 09:10:08 GMT-0700' Can you clarify the 4 day difference ? bq. 'right now April 14th' The date of your email was Apr 16th. On Sat, Apr 16, 2016 at 9:39 AM,

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Cody, you're right that was an example. Target architecture would be 3 DCs :) Good point on ZK, I'll have to check that. About Spark, both instances will run at the same time but on different topics. That would be quite useless to have to 2DCs working on the same set of data. I just want, in case

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
Maybe I'm missing something, but I don't see how you get a quorum in only 2 datacenters (without splitbrain problem, etc). I also don't know how well ZK will work cross-datacenter. As far as the spark side of things goes, if it's idempotent, why not just run both instances all the time. On

Re: Cached Parquet file paths problem

2016-04-19 Thread Piotr Smoliński
Solved it. The anonymous RDDs can be cached in the cacheManager in SQLContext. In order to remove all the cached content use: sqlContext.clearCache() The warning symptom about failed data frame registration is the following entry in the log: 16/04/16 20:18:39 [tp439928219-110] WARN

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
I'm describing a disaster recovery but it can be used to make one datacenter offline for upgrade for instance. >From my point of view when DC2 crashes: *On Kafka side:* - kafka cluster will lose one or more broker (partition leader and replica) - partition leader lost will be reelected in the

Re: hbaseAdmin tableExists create catalogTracker for every call

2016-04-19 Thread Ted Yu
The CatalogTracker object may not be used by all the methods of HBaseAdmin. Meaning, when HBaseAdmin is constructed, we don't need CatalogTracker. On Tue, Apr 19, 2016 at 6:09 AM, WangYQ wrote: > in hbase 0.98.10, class "HBaseAdmin " > line 303, method

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
It the main concern uptime or disaster recovery? > On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: > > I think the bigger question is what happens to Kafka and your downstream data > store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
I think the bigger question is what happens to Kafka and your downstream data store when DC2 crashes. >From a Spark point of view, starting up a post-crash job in a new data center isn't really different from starting up a post-crash job in the original data center. On Tue, Apr 19, 2016 at 3:32

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Arkadiusz Bicz
Sorry, I've found one error: If you do NOT need any relational processing of your messages ( basing on historical data, or joining with other messages) and message processing is quite independent Kafka plus Spark Streaming could be overkill. On Tue, Apr 19, 2016 at 1:54 PM, Arkadiusz Bicz

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Arkadiusz Bicz
Requirements looks like my previous project for smart metering. We finally did custom solution without Spark, Hadoop and Kafka but it was 4 years ago when I did not have experience with this technologies ( some not existed or were not mature). If you do need any relational processing of your

Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Prashant Sharma
This can happen if system time is not in sync. By default, streaming uses SystemClock(it also supports ManualClock) and that relies on System.currentTimeMillis() for determining start time. Prashant Sharma On Sat, Apr 16, 2016 at 10:09 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote:

Exceeding spark.akka.frameSize when saving Word2VecModel

2016-04-19 Thread Stefan Falk
Hello Sparklings! I am trying to train a word vector model but as I call Word2VecModel#save() I am getting a org.apache.spark.SparkException saying that this would exceed the frameSize limit (stackoverflow question [1]). Increasing the frameSize would only help me in this particular case I

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Jörn Franke
I do not think there is a simple how to for this. First you need to be clear of volumes in storage, in-transit and in-processing. Then you need to be aware of what kind of queries you want to do. Your assumption of milliseconds for he expected data volumes currently seem to be unrealistic.

Re: prefix column Spark

2016-04-19 Thread nihed mbarek
Hi thank you, it's the first solution and it took a long time to manage all my fields Regards, On Tue, Apr 19, 2016 at 11:29 AM, Ndjido Ardo BAR wrote: > > This can help: > > import org.apache.spark.sql.DataFrame > > def prefixDf(dataFrame: DataFrame, prefix: String):

RE: Spark + HDFS

2016-04-19 Thread Ashic Mahtab
Spark will execute as a client for hdfs. In other words, it'll contact the hadoop master for the hdfs cluster, which will return the block info, and then the data will be fetched from the data nodes. Date: Tue, 19 Apr 2016 14:00:31 +0530 Subject: Spark + HDFS From: chaturvedich...@gmail.com To:

SQL Driver

2016-04-19 Thread AlexModestov
Hello all, I use a string when I'm launching the Sparkling-Water: "--conf spark.driver.extraClassPath='/SQLDrivers/sqljdbc_4.2/enu/sqljdbc41.jar" and I get the error: " --- TypeError Traceback

Re: prefix column Spark

2016-04-19 Thread Ndjido Ardo BAR
This can help: import org.apache.spark.sql.DataFrame def prefixDf(dataFrame: DataFrame, prefix: String): DataFrame = { val colNames = dataFrame.columns colNames.foldLeft(dataFrame){ (df, colName) => { df.withColumnRenamed(colName, s"${prefix}_${colName}") } } }

Re: spark sql on hive

2016-04-19 Thread Mich Talebzadeh
This is not a bug of Hive. Spark uses hive-site.xml to get the location of Hive metastore. You cannot connect directly to Hive metastore and interrogate metastore directly. You will need to know the metastore schema. hive.metastore.uris thrift://:9083 Thrift URI for the remote

prefix column Spark

2016-04-19 Thread nihed mbarek
Hi, I want to prefix a set of dataframes and I try two solutions: * A for loop calling withColumnRename based on columns() * transforming my Dataframe to and RDD, updating the old schema and recreating the dataframe. both are working for me, the second one is faster with tables that contain 800

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. As I mentionned before, I'm planning to use one kafka cluster and 2 or more spark cluster distinct. Let's say we have the following DCs configuration in a nominal case. Kafka partitions are consumed uniformly by the 2

Spark + HDFS

2016-04-19 Thread Chaturvedi Chola
When I use spark and hdfs on two different clusters. How does spark workers know that which block of data is available in which hdfs node. Who basically caters to this. Can someone throw light on this.

Re: Code optimization

2016-04-19 Thread Alonso Isidoro Roman
Hi Angel, how about to use this : k.filter(k("WT_ID") as a val variable? i think you can avoid that and do not forget to use System.nanoTime to know the profit... Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces

Re:Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread 李明伟
Hi Zhan Zhang Please see the exception trace below. It is saying some GC overhead limit error I am not a java or scala developer so it is hard for me to understand these infor. Also reading coredump is too difficult to me.. I am not sure if the way I am using spark is correct. I understand

How to know whether I'm in the first batch of spark streaming

2016-04-19 Thread Yu Xie
hi spark users I'm running a spark streaming application, with concurrentJobs > 1, so maybe more than one batches could run together. Now I would like to do some init work in the first batch based on the "time" of the first batch. So even the second batch runs faster than the first batch, I

Minimum granularity of batch interval in Spark streaming

2016-04-19 Thread Mich Talebzadeh
Hi, What is the minimum granularity of micro-batch interval in Spark streaming if there is such limit? Is throughput within the window configurable. For example what are the parameters affecting the volume of streaming data being processed? In another words throughput per second. Thanks Dr

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Alex Kozlov
This is too big of a topic. For starters, what is the latency between you obtain the data and the data is available for analysis? Obviously if this is < 5 minutes, you probably need a streaming solution. How fast the "micro batches of seconds" need to be available for analysis? Can the problem

Code optimization

2016-04-19 Thread Angel Angel
Hello, I am writing the one spark application, it runs well but takes long execution time can anyone help me to optimize my query to increase the processing speed. I am writing one application in which i have to construct the histogram and compare the histograms in order to find the final

Re: [Spark 1.5.2] Log4j Configuration for executors

2016-04-19 Thread Prashant Sharma
May be you can try creating it before running the App.