Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations. > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > Dear all, > > Is there any way to change the host location for a certain

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread ayan guha
"If data ingestion speed is faster than data production speed, then eventually the entire database will be harvested and those workers will start to "tail" the database for new data streams and the processing becomes real time." This part is really database dependent. So it will be hard to

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread 任弘迪
why not sync binlog of mysql(hopefully the data is immutable and the table is append-only), send the log through kafka and then consume it by spark streaming? On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust wrote: > We don't support this yet, but I've opened this JIRA

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard, In the provided documentation, under the paragraph "Schema Merging", you can actually perform what you want this way: 1. Create a schema that read the raw json, line by line 2. Create another schema that read the json file and structure it in ("id", "ln", "fn") 3. Merge the two

Re: Best way to process lookup ETL with Dataframes

2016-12-29 Thread ayan guha
How about this - select a.*, nvl(b.col,nvl(c.col,'some default')) from driving_table a left outer join lookup1 b on a.id=b.id left outer join lookup2 c on a.id=c,id ? On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike wrote: > Hi all, > > > I'm writing an ETL process with

[Spark streaming 1.6.0] Spark streaming with Yarn: executors not fully utilized

2016-12-29 Thread Nishant Kumar
I am running spark streaming with Yarn - *spark-submit --master yarn --deploy-mode cluster --num-executors 2 > --executor-memory 8g --driver-memory 2g --executor-cores 8 ..* > I am consuming Kafka through DireactStream approach (No receiver). I have 2 topics (each with 3 partitions). I

RDD Location

2016-12-29 Thread Fei Hu
Dear all, Is there any way to change the host location for a certain partition of RDD? "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization? Thanks, Fei

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
thanks, I have seen this, but this doesn't cover my question. What I need is read json and include raw json as part of my dataframe. On Friday, December 30, 2016 10:23 AM, Annabel Melongo wrote: Richard, Below documentation will show you how to create

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard, Below documentation will show you how to create a sparkSession and how to programmatically load data: Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | On Thursday, December 29, 2016 5:16 PM,

DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Richard Xin
Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25} {"id":1235,"ln":"Doe","fn":"Jane","age":22} java code snippet:        final SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("json_test");         JavaSparkContext ctx = new

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031 In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe

Best way to process lookup ETL with Dataframes

2016-12-29 Thread Sesterhenn, Mike
Hi all, I'm writing an ETL process with Spark 1.5, and I was wondering the best way to do something. A lot of the fields I am processing require an algorithm similar to this: Join input dataframe to a lookup table. if (that lookup fails (the joined fields are null)) { Lookup into some

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes you're just dividing by the norm of the vector you passed in. You can look at the change on that JIRA and probably see how this was added into the method itself. On Thu, Dec 29, 2016 at 10:34 PM Manish Tripathi wrote: > ok got that. I understand that the ordering won't

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
ok got that. I understand that the ordering won't change. I just wanted to make sure I am getting the right thing or I understand what I am getting since it didn't make sense going by the cosine calculation. One last confirmation and I appreciate all the time you are spending to reply: In the

Re: Invert large matrix

2016-12-29 Thread Yanwei Wayne Zhang
Thanks for the advice. I figured out a way to solve this problem by avoiding the matrix representation. Wayne From: Sean Owen Sent: Thursday, December 29, 2016 1:52 PM To: Yanwei Wayne Zhang; user Subject: Re: Invert large matrix I think

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes, the vectors are not otherwise normalized. You are basically getting the cosine similarity, but times the norm of the word vector you supplied, because it's not divided through. You could just divide the results yourself. I don't think it will be back-ported because the the behavior was

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
Sean, Thanks for answer. I am using Spark 1.6 so are you saying the output I am getting is cos(A,B)=dot(A,B)/norm(A) ? My point with respect to normalization was that if you normalise or don't normalize both vectors A,B, the output would be same. Since if I normalize A and B, then Cos(A,B)=

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
It should be the cosine similarity, yes. I think this is what was fixed in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really just outputting the 'unnormalized' similarity (dot / norm(a) only) but the docs said cosine similarity. Now it's cosine similarity in Spark 2. The

Re: Invert large matrix

2016-12-29 Thread Sean Owen
I think the best advice is: don't do that. If you're trying to solve a linear system, solve the linear system without explicitly constructing a matrix inverse. Is that what you mean? On Thu, Dec 29, 2016 at 2:22 AM Yanwei Wayne Zhang < actuary_zh...@hotmail.com> wrote: > Hi all, > > > I have a

Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: ​ I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity

Reading specific column family and columns in Hbase table through spark

2016-12-29 Thread Mich Talebzadeh
Hi, I have a routine in Spark that iterates through Hbase rows and tries to read columns. My question is how can I read the correct ordering of columns? example val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
Hello no sorry i dont have any further insight into that i have seen similar errors but for completely different issues, and in most of hte cases it had to do with my data or my processing rather than Spark itself. What does your program do? you say it runs for 2-3 hours, what is the logic?

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hello Jacek, Actually, Reynold is still the release manager and I am just sending this message for him :) Sorry. I should have made it clear in my original email. Thanks, Yin On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski wrote: > Hi Yan, > > I've been surprised the first

Re: Issue with SparkR setup on RStudio

2016-12-29 Thread Felix Cheung
Any reason you are setting HADOOP_HOME? >From the error it seems you are running into issue with Hive config likely >with trying to load hive-site.xml. Could you try not setting HADOOP_HOME From: Md. Rezaul Karim Sent:

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Nicholas Hakobian
If you are using spark 2.0 (as listed in the stackoverflow post) why are you using the external CSV module from Databricks? Spark 2.0 includes the functionality from this external module natively, and its possible you are mixing an older library with a newer spark which could explain a crash.

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Jacek Laskowski
Hi Yan, I've been surprised the first time when I noticed rxin stepped back and a new release manager stepped in. Congrats on your first ANNOUNCE! I can only expect even more great stuff coming in to Spark from the dev team after Reynold spared some time  Can't wait to read the changes...

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Holden Karau
Alternatively, using the broadcast functionality can also help with this. On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern wrote: > 2016-12-28 20:17 GMT+01:00 Chawla,Sumit : > > Would this work for you? > > def processRDD(rdd): > analyzer =

Issue with SparkR setup on RStudio

2016-12-29 Thread Md. Rezaul Karim
Dear Spark users, I am trying to setup SparkR on RStudio to perform some basic data manipulations and ML modeling. However, I am a strange error while creating SparkR session or DataFrame that says: java.lang.IllegalArgumentException Error while instantiating

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai
Hello, We spent sometime preparing artifacts and changes to the website (including the release notes). I just sent out the the announcement. 2.1.0 is officially released. Thanks, Yin On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller < justin.mil...@protectwise.com> wrote: > Interesting, because

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all, Apache Spark 2.1.0 is the second release of Spark 2.x line. This release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Eike von Seggern
2016-12-28 20:17 GMT+01:00 Chawla,Sumit : > Would this work for you? > > def processRDD(rdd): > analyzer = ShortTextAnalyzer(root_dir) > rdd.foreach(lambda record: analyzer.analyze_short_text_ > event(record[1])) > > ssc.union(*streams).filter(lambda x: x[1] !=

send this email to unsubscribe

2016-12-29 Thread John

[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Yuanzhe Yang (杨远哲)
Hi all, Thanks a lot for your contributions to bring us new technologies. I don't want to waste your time, so before I write to you, I googled, checked stackoverflow and mailing list archive with keywords "streaming" and "jdbc". But I was not able to get any solution to my use case. I hope I

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Palash Gupta
Hi Marco, Thanks for your response. Yes I tested it before & am able to load from linux filesystem and it also sometimes have similar issue. However in both cases (either from hadoop or linux file system), this error comes in some specific scenario as per my observations: 1. When two parallel

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
Hi Pls try to read a CSV from filesystem instead of hadoop. If you can read it successfully then your hadoop file is the issue and you can start debugging from there. Hth On 29 Dec 2016 6:26 am, "Palash Gupta" wrote: > Hi Apache Spark User team, > > > >

spark program for dependent cascading operations

2016-12-29 Thread srimugunthan dhandapani
Hi all, Can somebody solve the below problem using spark? There are two dataset of numbers Set1= {4,6,11,14} and Set2= {5,11,12,3} I have to subtract each element of Set 1 by elements of Set2. But if the element in Set2 is bigger, then the residue left after subtraction is used for the

Re: Jdbc connection from sc.addJar failing

2016-12-29 Thread drtrotter74
Have you found any solution to this? I am having a similar issue where my db2jcc.jar license file is not being found. I was hoping the addjar() method would work, but it does not seem to help. I cannot even get the addjar syntax correct it seems... Can I just call it inline?: