Re: Undocumented left join constraint?

2016-05-27 Thread Michael Armbrust
Sounds like: https://issues.apache.org/jira/browse/SPARK-15441, for which a fix is in progress. Please do keep reporting issues though, these are great! Michael On Fri, May 27, 2016 at 1:01 PM, Tim Gautier wrote: > Is it truly impossible to left join a Dataset[T] on the

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
yes, only for engine, but maybe newer version has more optimization from tungsten project? at least since spark 1.6? > -- Forwarded message -- > From: Mich Talebzadeh > Date: 27 May 2016 at 17:09 > Subject: Re: Pros and Cons > To: Teng Qiu

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I'm working around it like this: val testMapped2 = test1.rdd.map(t => t.copy(id = t.id + 1)).toDF.as[Test] testMapped2.as("t1").joinWith(testMapped2.as("t2"), $"t1.id" === $"t2.id ").show Switching from RDD, then mapping, then going back to DS seemed to avoid the issue. On Fri, May 27, 2016 at

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Koert Kuipers
i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely. my observation was that joins of 2 datasets that were derived from the same datasource gave this kind of trouble. i changed my datasource from val to def (so it got created twice) as a

Spark Streaming - Is window() caching DStreams?

2016-05-27 Thread Marco1982
Dear all, Can someone please explain me how Spark Streaming executes the window() operation? From the Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory, but looking at the web UI it seems that operations already executed in previous batches are executed

Re: Undocumented left join constraint?

2016-05-27 Thread Tim Gautier
When I run it in 1.6.1 I get this: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Null value appeared in non-nullable field: - field (class: "scala.Int", name: "id") - root class: "$iwC.$iwC.Test" If the schema is inferred from a Scala tuple/case class, or a Java

Re: Undocumented left join constraint?

2016-05-27 Thread Tim Gautier
Interesting, I did that on 1.6.1, Scala 2.10 On Fri, May 27, 2016 at 2:41 PM Ted Yu wrote: > Which release did you use ? > > I tried your example in master branch: > > scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS > test2: org.apache.spark.sql.Dataset[Test] = [id:

Re: Undocumented left join constraint?

2016-05-27 Thread Ted Yu
Which release did you use ? I tried your example in master branch: scala> val test2 = Seq(Test(2), Test(3), Test(4)).toDS test2: org.apache.spark.sql.Dataset[Test] = [id: int] scala> test1.as("t1").joinWith(test2.as("t2"), $"t1.id" === $"t2.id", "left_outer").show +---+--+ | _1|_2|

Range partition for parquet file?

2016-05-27 Thread Rex Xiong
Hi, I have a spark job output DataFrame which contains a column named Id, which is a GUID string. We will use Id to filter data in another spark application, so it should be a partition key. I found these two methods in Internet: 1. DataFrame.write.save("Id") method will help, but the possible

Undocumented left join constraint?

2016-05-27 Thread Tim Gautier
Is it truly impossible to left join a Dataset[T] on the right if T has any non-option fields? It seems Spark tries to create Ts with null values in all fields when left joining, which results in null pointer exceptions. In fact, I haven't found any other way to get around this issue without making

Re: Can not set spark dynamic resource allocation

2016-05-27 Thread Cui, Weifeng
Sorry to reply this late. yarn.nodemanager.log-dirs /local/output/logs/nm-log-dir We do not use file:// in the settings, so that should not be the problem. Any other guesses? Weifeng On 5/20/16, 2:40 PM, "David Newberger" wrote: >Hi All,

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-27 Thread Govindasamy, Nagarajan
Hi Ted, The link is useful. Still I could not figure out the way to convert the RDD[GenericRecord] in to DF. Tried to create the spark sql schema from avro schema. val json = """{"type":"record","name":"Profile","fields": [{"name":"userid","type":"string"},

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Yong Zhang
I am not familiar with that particular piece of code. But the spark's concurrency comes from Multi-thread. One executor will use multi threads to process tasks, and these tasks share the JVM memory of the executor. So it won't be surprised that Spark needs some blocking/sync for the memory some

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Jacek Laskowski
Hi Yong, It makes sense...almost. :) I'm not sure how relevant it is, but just today was reviewing BlockInfoManager code with the locks for reading and writing, and my understanding of the code shows that Spark if fine when there are multiple attempts for writes of new memory blocks (pages) with

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
I tried master branch : scala> val testMapped = test.map(t => t.copy(id = t.id + 1)) testMapped: org.apache.spark.sql.Dataset[Test] = [id: int] scala> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $" t2.id").show org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`'

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
Oops, screwed up my example. This is what it should be: case class Test(id: Int) val test = Seq( Test(1), Test(2), Test(3) ).toDS test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show val testMapped = test.map(t => t.copy(id = t.id + 1))

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I figured it out the trigger. Turns out it wasn't because I loaded it from the database, it was because the first thing I do after loading is to lower case all the strings. After a Dataset has been mapped, the resulting Dataset can't be self joined. Here's a test case that illustrates the issue:

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I stand corrected. I just created a test table with a single int field to test with and the Dataset loaded from that works with no issues. I'll see if I can track down exactly what the difference might be. On Fri, May 27, 2016 at 10:29 AM Tim Gautier wrote: > I'm using

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I'm using 1.6.1. I'm not sure what good fake data would do since it doesn't seem to have anything to do with the data itself. It has to do with how the Dataset was created. Both datasets have exactly the same data in them, but the one created from a sql query fails where the one created from a

Spark_API_Copy_From_Edgenode

2016-05-27 Thread Ajay Chander
Hi Everyone, I have some data located on the EdgeNode. Right now, the process I follow to copy the data from Edgenode to HDFS is through a shellscript which resides on Edgenode. In Oozie I am using a SSH action to execute the shell script on Edgenode which copies the

Re: JDBC Create Table

2016-05-27 Thread Mich Talebzadeh
are you using JDBC in spark shell Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 27 May 2016 at

Re: JDBC Create Table

2016-05-27 Thread Anthony May
Hi Andrés, What error are you seeing? Can you paste the stack trace? Anthony On Fri, 27 May 2016 at 08:37 Andrés Ivaldi wrote: > Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to > fail because is not possible create new tables in SQLServer, I'm using

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
Which release of Spark are you using ? Is it possible to come up with fake data that shows what you described ? Thanks On Fri, May 27, 2016 at 8:24 AM, Tim Gautier wrote: > Unfortunately I can't show exactly the data I'm using, but this is what > I'm seeing: > > I have

Fwd: Pros and Cons

2016-05-27 Thread Mich Talebzadeh
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com -- Forwarded message -- From: Mich

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p 2016-05-27 17:39 GMT+02:00 Ted Yu : > Teng: > Why not try out the 2.0 SANPSHOT build ? > > Thanks > >> On May 27, 2016, at 7:44 AM, Teng Qiu wrote: >> >> ah, yes, the version

Re: Pros and Cons

2016-05-27 Thread Mich Talebzadeh
Hi Ted, do you mean Hive 2 with spark 2 snapshot build as the execution engine just binaries for snapshot (all ok)? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Pros and Cons

2016-05-27 Thread Ted Yu
Teng: Why not try out the 2.0 SANPSHOT build ? Thanks > On May 27, 2016, at 7:44 AM, Teng Qiu wrote: > > ah, yes, the version is another mess!... no vendor's product > > i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. > > hadoop 2.6.2, hive 2.0.1 with

I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
Unfortunately I can't show exactly the data I'm using, but this is what I'm seeing: I have a case class 'Product' that represents a table in our database. I load that data via sqlContext.read.format("jdbc").options(...).load.as[Product] and register it in a temp table 'product'. For testing, I

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
ah, yes, the version is another mess!... no vendor's product i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this from hive side https://issues.apache.org/jira/browse/HIVE-13301 the jackson-databind lib from

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when

JDBC Create Table

2016-05-27 Thread Andrés Ivaldi
Hello, yesterday I updated Spark 1.6.0 to 1.6.1 and my tests starts to fail because is not possible create new tables in SQLServer, I'm using SaveMode.Overwrite as in 1.6.0 version Any Idea regards -- Ing. Ivaldi Andres

Re: Pros and Cons

2016-05-27 Thread Mich Talebzadeh
Hi Teng, what version of spark are using as the execution engine. are you using a vendor's product here? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Python memory included YARN-monitored memory?

2016-05-27 Thread Mike Sukmanowsky
Hi everyone, More of a YARN/OS question than a Spark one, but would be good to clarify this on the docs somewhere once I get an answer. We use PySpark for all our Spark applications running on EMR. Like many users, we're accustomed to seeing the occasional ExecutorLostFailure after YARN kills a

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-27 Thread Aaron Ilovici
Mohammed, The Spark Connector for Vertica is still in Beta and while that is still an option I would prefer native support from Spark. Considering all data types seem to map with the aggregated dialect except for NULL types, I imagine the work involved would be relatively minimal. I would be

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
I agree with Koert and Reynold, spark works well with large dataset now. back to the original discussion, compare SparkSQL vs Hive in Spark vs Spark API. SparkSQL vs Spark API you can simply imagine you are in RDBMS world, SparkSQL is pure SQL, and Spark API is language for writing stored

pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

2016-05-27 Thread Andrew Vykhodtsev
Dear list, I am trying to calculate sum and count on the same column: user_id_books_clicks = (sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet') .groupby('user_id') .agg({'is_booking':'count',

RE: GraphX Java API

2016-05-27 Thread Santoshakhilesh
GraphX APis are available only in Scala. If you need to use GraphX you need to switch to Scala. From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com] Sent: 27 May 2016 19:59 To: user@spark.apache.org Subject: GraphX Java API Hi, We are trying to consume the Java API for

GraphX Java API

2016-05-27 Thread Kumar, Abhishek (US - Bengaluru)
Hi, We are trying to consume the Java API for GraphX, but there is no documentation available online on the usage or examples. It would be great if we could get some examples in Java. Thanks and regards, Abhishek Kumar Products & Services | iLab Deloitte Consulting LLP Block ‘C’, Divyasree

DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors

2016-05-27 Thread nmoretto
Hello everyone, I am trying to compute the similarity between 550k objects using the DIMSUM algorithm available in Spark 1.6. The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each. My input data is a

Re: problem about RDD map and then saveAsTextFile

2016-05-27 Thread Christian Hellström
Internally, saveAsTextFile uses saveAsHadoopFile: https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala . The final bit in the method first creates the output path and then saves the data set. However, if

unsubscribe

2016-05-27 Thread Rakesh H (Marketing Platform-BLR)

problem about RDD map and then saveAsTextFile

2016-05-27 Thread Reminia Scarlet
Hi all: I’ve tried to execute something as below: result.map(transform).saveAsTextFile(hdfsAddress) Result is a RDD caluculated from mlilib algorithm. I submit this to yarn, and after two attempts , the application failed. But the exception in log is very missleading. It said hdfsAddress

unsubscribe

2016-05-27 Thread Vinoth Sankar
unsubscribe

Re: Logistic Regression in Spark Streaming

2016-05-27 Thread kundan kumar
Agree, we have logistic regression example. I was looking for its counterpart to "StreamingLinearRegressionWithSGD". On Fri, May 27, 2016 at 1:16 PM, Alonso Isidoro Roman wrote: > I do not have any experience using LR in spark, but you can see that LR is > already

Re: Logistic Regression in Spark Streaming

2016-05-27 Thread Alonso Isidoro Roman
I do not have any experience using LR in spark, but you can see that LR is already implemented in mllib. http://spark.apache.org/docs/latest/mllib-linear-methods.html Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Logistic Regression in Spark Streaming

2016-05-27 Thread kundan kumar
Hi , Do we have a streaming version of Logistic Regression in Spark ? I can see its there for the Linear Regression. Has anyone used logistic regression on streaming data, it would be really helpful if you share your insights on how to train the incoming data. In my use case I am trying to use

submitMissingTasks - serialize throws StackOverflow exception

2016-05-27 Thread Michel Hubert
Hi, My Spark application throws stackoverflow exceptions after a while. The DAGScheduler function submitMissingTasks tries to serialize a Tuple (MapPartitionsRDD, EsSpark..saveToEs) which is handled with a recursive algorithm. The recursive algorithm is too deep and results in a stackoverflow