[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path:

Spark 3 migration question

2022-05-18 Thread Jason Xu
Hi Spark user group, Spark 2.4 to 3 migration for existing Spark jobs seems a big challenge given a long list of changes in migration guide , they could introduce failures or output changes related to

[Announcement] Analytics Zoo 0.11.0 release

2021-07-21 Thread Jason Dai
etc.) - Enhancements to Orca (scaling TF/PyTorch models to distributed Big Data) for end-to-end computer vision pipelines (distributed image preprocessing, training and inference); for more information, please see our CPVR 2021 tutorial <https://jason-dai.github.io/cvpr2021/>

[Announcement] Analytics Zoo 0.10.0 release

2021-04-28 Thread Jason Dai
nmodified Apache Spark/Flink and TF/PyTorch/BigDL/OpenVINO in a secure fashion on cloud Thanks, -Jason

Fwd: [Announcement] Analytics Zoo 0.8 release

2020-04-27 Thread Jason Dai
FYI :-) -- Forwarded message - From: Jason Dai Date: Tue, Apr 28, 2020 at 10:31 AM Subject: [Announcement] Analytics Zoo 0.8 release To: BigDL User Group Hi all, We are happy to announce the 0.8 release of Analytics Zoo <https://github.com/intel-analytics/analytics-

[Announcement] Analytics Zoo 0.8 release

2020-04-27 Thread Jason Dai
nalytics-zoo.github.io/0.8.1/#gettingstarted/> page. Thanks, -Jason

Re: [Announcement] Analytics Zoo 0.7.0 release

2020-01-20 Thread Jason Dai
Fixed one typo below: should be TensorFlow 1.15 support in tfpark :-) On Tue, Jan 21, 2020 at 7:52 AM Jason Dai wrote: > Hi all, > > > > We are happy to announce the 0.7.0 release of Analytics Zoo > <https://github.com/intel-analytics/analytics-zoo/>, a unified Data >

[Announcement] Analytics Zoo 0.7.0 release

2020-01-20 Thread Jason Dai
el-analytics/analytics-zoo/ Thanks, -Jason

Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Jason Nerothin
-- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Jason

alternatives to shading

2019-12-17 Thread Jason Nerothin
are considering implementing our own ClassLoaders and/or rebuilding and shading the spark distribution. Are there better alternatives? -- Thanks, Jason

[Announcement] Analytics Zoo 0.5 release

2019-06-17 Thread Jason Dai
releases. For more details, you may refer to the project website at https://github.com/intel-analytics/analytics-zoo/ Thanks, -Jason

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
(current behavior). By analogy to the halting problem <https://en.wikipedia.org/wiki/Chaitin%27s_constant#Relationship_to_the_halting_problem>, I believe that expecting a program to handle all possible exceptional states is unreasonable. Jm2c Jason On Tue, May 21, 2019 at 9:30 AM bsi

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Correction: The Driver manages the Tasks, the resource manager serves up resources to the Driver or Task. On Tue, May 21, 2019 at 9:11 AM Jason Nerothin wrote: > The behavior is a deliberate design decision by the Spark team. > > If Spark were to "fail fast", it would prev

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
You might also look at the way the Apache Livy <https://livy.incubator.apache.org/> team is implementing their solution. HTH Jason On Tue, May 21, 2019 at 6:04 AM bsikander wrote: > Ok, I found the reason. > > In my QueueStream example, I have a while(true) which keeps on adding t

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jason Nerothin
rs) - to provide a single abstraction that works across Scala, Java, Python, and R. But what they came up with required the APIs you list to make it work.) Think carefully about what new things you're trying to provide and what things you're trying to hide beneath your abstraction. HTH Jason On W

Re: BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-05-14 Thread Jason Dai
The slides for the talks have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Fri, Apr 19, 2019 at 9:55 PM Khare, Ankit wrote: > Thanks for sharing. > > Sent from my iPhone > > On 19. Apr 2019, at 01:35, Jason Dai wrote: > > Hi

Re: Streaming job, catch exceptions

2019-05-12 Thread Jason Nerothin
ubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Jason

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Jason Dai
of Keras 1.2.2 on Spark (using BigDL). Thanks, -Jason On Mon, May 6, 2019 at 5:37 AM Riccardo Ferrari wrote: > Thanks everyone, I really appreciate your contributions here. > > @Jason, thanks for the references I'll take a look. Quickly checking > github: > https://github.com/intel-anal

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Jason Dai
tps://software.intel.com/en-us/articles/talroo-uses-analytics-zoo-and-aws-to-leverage-deep-learning-for-job-recommendations> Thanks, -Jason On Sun, May 5, 2019 at 6:29 AM Riccardo Ferrari wrote: > Thank you for your answers! > > While it is clear each DL framework can solve the di

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
See also here: https://stackoverflow.com/questions/44671597/how-to-replace-null-values-with-a-specific-value-in-dataframe-using-spark-in-jav On Mon, Apr 29, 2019 at 5:27 PM Jason Nerothin wrote: > Spark SQL has had an na.fill function on it since at least 2.1. Would that > work f

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
t; Is there any way to override this , other than using na.fill functions >> >> Regards, >> Snehasish >> > -- Thanks, Jason

Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan, Do you have to use Parquet? It just feels like it might be the wrong sink for a high-frequency change scenario. What are you trying to accomplish? Thanks, Jason On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri wrote: > Hello All, > > If I am doing incremental load / delta

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Jason Nerothin
$SPARK_HOME/jars and usually you're implementing against just a few of the essential ones like spark-sql. --jars is what to use for spark-shell. Final related tidbit: If you're implementing in Scala, make sure your jars are version-compatible with the scala compiler version (2.1.1 as of Spark 2.4

BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-04-18 Thread Jason Dai
ark and BigDL <https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/74077> at *Strata Data Conference in London* (16:35-17:15pm, Wednesday, 1 May 2019) If you plan to attend these events, please drop by and talk to the speakers :-) Thanks, -Jason

Re: Spark2: Deciphering saving text file name

2019-04-09 Thread Jason Nerothin
each write action completes. HTH, Jason On Mon, Apr 8, 2019 at 19:55 Subash Prabakar wrote: > Hi, > While saving in Spark2 as text file - I see encoded/hash value attached in > the part files along with part number. I am curious to know what is that > value is about ?

Re: Structured streaming flatMapGroupWithState results out of order messages when reading from Kafka

2019-04-09 Thread Jason Nerothin
pWithState with Kafka > source? > > This my pipline; > > Kafka => GroupByKey(key from Kafka schema) => flatMapGroupWithState => > parquet > > When I printed out the Kafka offset for each key inside my state update > function they are not in order. I am using spark 2.3.3. > > Thanks & Regards, > Akila > > > > -- Thanks, Jason

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
am not sure how that helps here > > On Fri, 5 Apr 2019, 10:13 pm Jason Nerothin, > wrote: > >> Have you looked at Arbitrary Stateful Streaming and Broadcast >> Accumulators? >> >> On Fri, Apr 5, 2019 at 10:55 AM Basavaraj wrote: >> >>> Hi

Re: combineByKey

2019-04-05 Thread Jason Nerothin
(x.Id, x.value))).aggregateByKey(Set[String]())( > (aggr, value) => aggr ++ Set(value._2), > (aggr1, aggr2) => aggr1 ++ aggr2).collect().toMap > > print(result) > > Map(0-d1 -> Set(t1, t2, t3, t4), 0-d2 -> Set(t1, t5, t6, t2), 0-d3 -> > Set(t1, t2

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
y pointers, constructive criticism > > Regards > Basav > - To > unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Thanks, Jason

Re: combineByKey

2019-04-05 Thread Jason Nerothin
et[String]) => accumulator1 ++ > accumulator2 > > sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id, > x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner) > > *Compile Error:-* > found : (String, String) => scala.collection.immutable.Set[String] > required: ((String, String)) => ? > sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id, > x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner) > > Regards, > Rajesh > > -- Thanks, Jason

Re: reporting use case

2019-04-04 Thread Jason Nerothin
lcome. > > > Thanks, > Prasad > > > > Thanks, > Prasad > -- Thanks, Jason

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
My thinking is that if you run everything in one partition - say 12 GB - then you don't experience the partitioning problem - one partition will have all duplicates. If that's not the case, there are other options, but would probably require a design change. On Thu, Apr 4, 2019 at 8:46 AM Jason

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-04 Thread Jason Nerothin
immediately > by phone or email and permanently delete this Communication from your > computer without making a copy. Thank you. -- Thanks, Jason

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
lter(df['update_time'] == df['wanted_time']) >>>>>> .drop('wanted_time').dropDuplicates('invoice_id', 'update_time') >>>>>> >>>>>> The min() is faster than doing an orderBy() and a row_number(). >>>>>> And the dropDuplicates at the end ensures records with two values for >>>>>> the same 'update_time' don't cause issues. >>>>>> >>>>>> >>>>>> On Thu, Apr 4, 2019 at 10:22 AM Chetan Khatri < >>>>>> chetan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Hello Dear Spark Users, >>>>>>> >>>>>>> I am using dropDuplicate on a DataFrame generated from large parquet >>>>>>> file from(HDFS) and doing dropDuplicate based on timestamp based column, >>>>>>> every time I run it drops different - different rows based on same >>>>>>> timestamp. >>>>>>> >>>>>>> What I tried and worked >>>>>>> >>>>>>> val wSpec = Window.partitionBy($"invoice_ >>>>>>> id").orderBy($"update_time".desc) >>>>>>> >>>>>>> val irqDistinctDF = irqFilteredDF.withColumn("rn", >>>>>>> row_number.over(wSpec)).where($"rn" === 1) >>>>>>> .drop("rn").drop("update_time") >>>>>>> >>>>>>> But this is damn slow... >>>>>>> >>>>>>> Can someone please throw a light. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> -- Thanks, Jason

Re: Upcoming talks on BigDL and Analytics Zoo this week

2019-04-03 Thread Jason Dai
The slides of the two technical talks for BigDL and Analytics Zoo ( https://github.com/intel-analytics/analytics-zoo/) at Strata Data Conference have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Sun, Mar 24, 2019 at 9:03 PM Jason Dai wrote: >

Re: How to extract data in parallel from RDBMS tables

2019-04-02 Thread Jason Nerothin
er of tables. > > > On Fri, Mar 29, 2019 at 5:04 AM Jason Nerothin > wrote: > >> How many tables? What DB? >> >> On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < >> surendra.manchika...@gmail.com> wrote: >> >>> Hi Jason, >>> &g

Re: Spark SQL API taking longer time than DF API.

2019-03-30 Thread Jason Nerothin
-- > > > As per my understanding, both Spark SQL and DtaaFrame API generate the > same code under the hood and execution time has to be similar. > > > Regards, > > Neeraj > > > -- Thanks, Jason

Re: How to extract data in parallel from RDBMS tables

2019-03-29 Thread Jason Nerothin
How many tables? What DB? On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Jason, > > Thanks for your reply, But I am looking for a way to parallelly extract > all the tables in a Database. > > > On Thu, Mar 28, 201

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Meant this one: https://docs.databricks.com/api/latest/jobs.html On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel wrote: > Thanks, are you referring to > https://github.com/spark-jobserver/spark-jobserver or the undocumented > REST job server included in Spark? > > > From: Jason

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
with 2 executors. The link for running > application “name” goes back to my server, the machine that launched the > job. > > > > This is spark.submit.deployMode = “client” according to the docs. I set > the Driver to run on the cluster but it runs on the client, ignoring the > spark.submit.deployMode. > > > > Is this as expected? It is documented nowhere I can find. > > > > > -- > Marcelo > > -- Thanks, Jason

Re: How to extract data in parallel from RDBMS tables

2019-03-28 Thread Jason Nerothin
wrote: > Hi All, > > Is there any way to copy all the tables in parallel from RDBMS using > Spark? We are looking for a functionality similar to Sqoop. > > Thanks, > Surendra > > -- Thanks, Jason

streaming - absolute maximum

2019-03-25 Thread Jason Nerothin
during streaming operation. Do I write to flavors of the query - one as a static Dataset for initiation and another for realtime? Is my logic incorrect? Thanks, Jason -- Thanks, Jason

Upcoming talks on BigDL and Analytics Zoo this week

2019-03-24 Thread Jason Dai
rata/strata-ca/public/schedule/detail/72802> at *Strata Data Conference in San Francisco* (March 28, 3:50–4:30pm) If you plan to attend these events, please drop by and talk to the speakers :-) Thanks, -Jason

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Jason Boorn
Ok great I’ll give that a shot - Thanks for all the help > On Apr 14, 2018, at 12:08 PM, Gene Pang <gene.p...@gmail.com> wrote: > > Yes, I think that is the case. I haven't tried that before, but it should > work. > > Thanks, > Gene > > On Fri, Apr 13, 2

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
on this - I feel like my particular use case is clouding what is already a tricky issue. > On Apr 13, 2018, at 2:26 PM, Gene Pang <gene.p...@gmail.com> wrote: > > Hi Jason, > > Alluxio does work with Spark in master=local mode. This is because both > spark-submit and spark

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Ok thanks - I was basing my design on this: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html Wherein it says: Once the SparkSession is instantiated, you can

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
it is not. Once again - I’m trying to solve the use case for master=local, NOT for a cluster and NOT with spark-submit. > On Apr 13, 2018, at 12:47 PM, yohann jardin <yohannjar...@hotmail.com> wrote: > > Hey Jason, > Might be related to what is behind your variable AL

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
I have setup some options using `.config("Option", > "value")` when creating the spark session, and then other runtime options as > you describe above with `spark.conf.set`. At this point though I've just > moved everything out into a `spark-submit` script. > > O

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Hi Geoff - Appreciate the help here - I do understand what you’re saying below. And I am able to get this working when I submit a job to a local cluster. I think part of the issue here is that there’s ambiguity in the terminology. When I say “LOCAL” spark, I mean an instance of spark that is

[Spark SQL] How to run a custom meta query for `ANALYZE TABLE`

2018-01-02 Thread Jason Heo
manually. Is there any API exposed to handle metastore_db (e.g. insert/delete meta db)? Regards, Jason

spark.pyspark.python is ignored?

2017-06-29 Thread Jason White
According to the documentation, `spark.pyspark.python` configures which python executable is run on the workers. It seems to be ignored in my simple test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock. The only customization at this point is my Hadoop configuration directory.

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d

Case class with POJO - encoder issues

2017-02-11 Thread Jason White
I'd like to create a Dataset using some classes from Geotools to do some geospatial analysis. In particular, I'm trying to use Spark to distribute the work based on ID and label fields that I extract from the polygon data. My simplified case class looks like this: implicit val geometryEncoder:

[ML] Converting ml.DenseVector to mllib.Vector

2016-12-30 Thread Jason Wolosonovich
with `Statistics`? Thanks! -Jason - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

UDAF collect_list: Hive Query or spark sql expression

2016-09-23 Thread Jason Mop
Hi Spark Team, I see most Hive function have been implemented by Spark SQL expression, but collect_list is still using Hive Query, will it also be implemented by Expression in future? any update? Cheers, Ming

Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Jason
to look into custom DataSet encoders though I'm not sure what kind of gain (if any) you'd get with that approach. Jason On Fri, Jun 17, 2016, 12:38 PM Everett Anderson <ever...@nuna.com.invalid> wrote: > Hi, > > I have a system with files in a variety of non-standard input fo

Re: DeepSpark: where to start

2016-05-05 Thread Jason Nerothin
Just so that there is no confusion, there is a Spark user interface project called DeepSense that is actually useful: http://deepsense.io I am not affiliated with them in any way... On Thu, May 5, 2016 at 9:42 AM, Joice Joy wrote: > What the heck, I was already beginning

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
s to restart from the last > commited offset > > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) + topic > subscr

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
will allow us to restart from the last > commited offset > > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) +

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
ting up a post-crash job in a new data center > isn't really different from starting up a post-crash job in the original data > center. > > On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com > <mailto:eallain.po...@gmail.com>> wrote: > Thanks Jaso

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io <http://insightedge.io/> . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks,

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
Could you post some psuedo-code? val data = rdd.whatever(…) val facts: Array[DroolsCompatibleType] = convert(data) facts.map{ f => ksession.insert( f ) } > On Apr 18, 2016, at 9:20 AM, yaoxiaohua <yaoxiao...@outlook.com> wrote: > > Thanks for your reply , Jason, > I can

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
with a unique id. To get the existing rules to be evaluated when we wanted, we kept a property on each fact that we mutated each time. Hackery, but it worked. I recommend you try hard to use a stateless KB, if it is possible. Thank you. Jason // brevity and poor typing by iPhone > On Apr 18, 2016, at

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-29 Thread Jason Plurad
UID. > > Cheers > > On Thu, Jan 28, 2016 at 1:38 PM, Jason Plurad <plur...@gmail.com> wrote: > >> I've searched through the mailing list archive. It seems that if you try >> to run, for example, a Spark 1.5.2 program against a Spark 1.5.1 standalone >

local class incompatible: stream classdesc serialVersionUID

2016-01-28 Thread Jason Plurad
I've searched through the mailing list archive. It seems that if you try to run, for example, a Spark 1.5.2 program against a Spark 1.5.1 standalone server, you will run into an exception like this: WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0,

Efficient join multiple times

2016-01-08 Thread Jason White
I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm trying to keep the join as efficient as possible so each batch finishes within the batch window. I'm using PySpark on 1.6 I've tried the trick of keying the large RDD into (k, v) pairs and using

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Jason Rubenstein
wrote one so we could upgrade to a specific version of Spark (via commit-hash) and used it to upgrade from 1.4.1. to 1.5.0 best, Jason On Thu, Nov 12, 2015 at 9:49 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > spark-ec2 does not offer a way to upgrade an existing cluster

Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White
o, for other folks who may read this, could reply back > with the trusted conversion that worked for you (for a clear solution)? > > TD > > > On Mon, Oct 19, 2015 at 3:08 PM, Jason White <jason.wh...@shopify.com> > wrote: > >> Ah, that makes sense then, thanks

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
Ah, that makes sense then, thanks TD. The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. --  Jason On October 19, 2

PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
I'm trying to create a DStream of DataFrames using PySpark. I receive data from Kafka in the form of a JSON string, and I'm parsing these RDDs of Strings into DataFrames. My code is: I get the following error at pyspark/streaming/util.py, line 64: I've verified that the sqlContext is properly

Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a

Re: Dynamic lookup table

2015-08-28 Thread Jason
your updates are atomic (transactions or CAS semantics) or you could run into race condition problems. Jason On Fri, Aug 28, 2015 at 11:39 AM N B nb.nos...@gmail.com wrote: Hi all, I have the following use case that I wanted to get some insight on how to go about doing in Spark Streaming

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
require more than 1000 reducers on my 10TB memory cluster (~7GB of spill per reducer). I'm now wondering if my shuffle partitions are uneven and I should use a custom partitioner, is there a way to get stats on the partition sizes from Spark ? On Fri, Aug 28, 2015 at 12:46 PM, Jason ja

Re: Getting number of physical machines in Spark

2015-08-28 Thread Jason
I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk from S3

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Jason
You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
I had similar problems to this (reduce side failures for large joins (25bn rows with 9bn)), and found the answer was to further up the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug

Persisting sorted parquet tables for future sort merge joins

2015-08-26 Thread Jason
, Jason

Persisting sorted parquet tables for future sort merge joins

2015-08-25 Thread Jason
] [ TungstenExchange hashpartitioning(CHROM#28399,pos#28332)] [ConvertToUnsafe] [ Scan ParquetRelation[file:exploded_sorted.parquet][pos#2.399]] Thanks, Jason

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad ir...@cloudphysics.com wrote: Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http

Re: Speed Benchmark

2015-02-27 Thread Jason Bell
How many machines are on the cluster? And what is the configuration of those machines (Cores/RAM)? Small cluster is very subjective statement. Guillaume Guy wrote: Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster.

Re: Running Example Spark Program

2015-02-22 Thread Jason Bell
If you would like a morr detailed walkthrough I wrote one recently. https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/ Regards Jason Bell On 22 Feb 2015 14:16, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com wrote: Try restarting your Spark

Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
suggestion. Best regards. Jason Hong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have any sample codes? 2014-12-31 13:27 GMT+09:00 Jason Hong begger3...@gmail.com: Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have

Applications status missing when Spark HA(zookeeper) enabled

2014-09-11 Thread jason chen
Hi guys, I configured Spark with the configuration in spark-env.sh: export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181 -Dspark.deploy.zookeeper.dir=/spark And I started spark-shell on one master host1(active):

RE: EC2 Cluster script. Shark install fails

2014-07-11 Thread Jason H
-on-spark.html http://spark.apache.org/docs/latest/sql-programming-guide.html On Thu, Jul 10, 2014 at 5:51 AM, Jason H jas...@developer.net.nz wrote: Hi Just going though the process of installing Spark 1.0.0 on EC2 and notice that the script throws an error when installing shark. Unpacking