Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Richard Eggert
One advantage of RDD's over DataFrames is that RDD's allow you to use your own data types, whereas DataFrames are backed by RDD's of Record objects, which are pretty flexible but don't give you much in the way of compile-time type checking. If you have an RDD of case class elements or JSON, then

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
Yes, I know, but it would be nice to be able to test things myself before I push commits. On Sun, Oct 25, 2015 at 3:50 PM, Ted Yu <yuzhih...@gmail.com> wrote: > If you have a pull request, Jenkins can test your change for you. > > FYI > > On Oct 25, 2015, at 12:43 PM, Richar

Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
When I try to start up sbt for the Spark build, or if I try to import it in IntelliJ IDEA as an sbt project, it fails with a "No such file or directory" error when it attempts to "git clone" sbt-pom-reader into .sbt/0.13/staging/some-sha1-hash. If I manually create the expected directory before

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
Also, if I run the Maven build on Windows or Linux without setting -DskipTests=true, it hangs indefinitely when it gets to org.apache.spark.JavaAPISuite. It's hard to test patches when the build doesn't work. :-/ On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert <richard.egg...@gmail.com>

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
25, 2015 at 3:38 PM, Richard Eggert <richard.egg...@gmail.com> wrote: > When I try to start up sbt for the Spark build, or if I try to import it > in IntelliJ IDEA as an sbt project, it fails with a "No such file or > directory" error when it attempts to "git clone"

Re: Pass spark partition explicitly ?

2015-10-18 Thread Richard Eggert
If you want to override the default partitioning behavior, you have to do so in your code where you create each RDD. Different RDDs usually have different numbers of partitions (except when one RDD is directly derived from another without shuffling) because they usually have different sizes, so

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Richard Eggert
I think the problem may be that callUDF takes a DataType indicating the return type of the UDF as its second argument. On Oct 12, 2015 9:27 AM, "Umesh Kacha" wrote: > Hi if you can help it would be great as I am stuck don't know how to > remove compilation error in callUdf

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Richard Eggert
It's the same as joining 2. Join two together, and then join the third one to the result of that. On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" wrote: > Can I join 3 different RDDs together in a Spark SQL DF? I can find > examples for 2 RDDs but not 3. > > > > Thanks > > >

Re: Create hashmap using two RDD's

2015-10-10 Thread Richard Eggert
Do you need the HashMap for anything else besides writing out to a file? If not, there is really no need to create one at all. You could just keep everything as RDDs. On Oct 10, 2015 11:31 AM, "kali.tumm...@gmail.com" wrote: > Got it ..., created hashmap and saved it to

Re: Create hashmap using two RDD's

2015-10-10 Thread Richard Eggert
quirement is to get latest records using a key i think hash map is a > good choice for this task. > As of now data comes from third party and we are not sure what's the > latest record is so hash map is chosen. > Is there anything better than hash map please let me know. > > Th

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
Since the Python API is built on top of the Scala implementation, its performance can be at best roughly the same as that of the Scala API (as in the case of DataFrames and SQL) and at worst several orders of magnitude slower. Likewise, since the a Scala implementation of new features

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
That should have read "a lot of neat tricks", not "a lot of nest tricks". That's what I get for sending emails on my phone On Oct 6, 2015 8:32 PM, "Richard Eggert" <richard.egg...@gmail.com> wrote: > Since the Python API is built on top of the Scala

RE: Why is 1 executor overworked and other sit idle?

2015-09-23 Thread Richard Eggert
eate a Cassandra RDD > > 2) Cache this RDD > > 3) Map it to CSV > > 4) Coalesce(because I need a single output file) > > 5) Write to file on local file system > > > > This makes sense. > > > > Thanks, > > > > Chirag >

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Richard Eggert
If there's only one partition, by definition it will only be handled by one executor. Repartition to divide the work up. Note that this will also result in multiple output files, however. If you absolutely need them to be combined into a single file, I suggest using the Unix/Linux 'cat' command

Re: Partitions on RDDs

2015-09-22 Thread Richard Eggert
In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among nodes

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Richard Eggert
Maybe it's just my phone, but I don't see any code. On Sep 22, 2015 11:46 AM, "juljoin" wrote: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I can't figure them out. In short, I wrote two programs that loop > through a

Re: PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-20 Thread Richard Eggert
eaterThan(C, X). You then can > programmatically convert C to a.c. Note that in the buildScan required > columns would also have an extra column C you need to returned in the > buildScan RDD. > > > It looks complicated, but I think it would work. > > > Thanks. > > &g

PrunedFilteredScan does not work for UDTs and Struct fields

2015-09-19 Thread Richard Eggert
I defined my own relation (extending BaseRelation) and implemented the PrunedFilteredScan interface, but discovered that if the column referenced in a WHERE = clause is a user-defined type or a field of a struct column, then Spark SQL passes NO filters to the PrunedFilteredScan.buildScan method,

Re: RDD transformation and action running out of memory

2015-09-12 Thread Richard Eggert
Hmm... The count() method invokes this: def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.length) } It appears that you're running out of memory while trying to compute (within the driver) the number of partitions that will

Re: Multithreaded vs Spark Executor

2015-09-12 Thread Richard Eggert
Parallel processing is what Spark was made for. Let it do its job. Spawning your own threads independently of what Spark is doing seems like you'd just be asking for trouble. I think you can accomplish what you want by taking the cartesian product of the data element RDD and the feature list RDD

UserDefinedTypes

2015-09-11 Thread Richard Eggert
Greetings, I have recently started using Spark SQL and ran up against two rather odd limitations related to UserDefinedTypes. The first is that there appears to be no way to register a UserDefinefType other than by adding the @SQLUserDefinedType annotation to the class being mapped. This makes

Re: Implement "LIKE" in SparkSQL

2015-09-11 Thread Richard Eggert
concat and locate are available as of version 1.5.0, according to the Scaladocs. For earlier versions of Spark, and for the operations that are still not supported, it's pretty straightforward to define your own UserDefinedFunctions in either Scala or Java (I don't know about other languages).