Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome! Kiran Kumar Dusi 于2024年3月21日周四 14:16写道: > +1 > > On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri < > farsheed.asho...@gmail.com> wrote: > >> +1 >> >> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, >> wrote: >> >>> Some of you may be aware that Databricks community Home |

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
wrote: >> >>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is >>> supported on GCS? IIRC it wasn't, but you could check with GCP support >>> >>> >>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev >>> wr

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark

Re: Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-20 Thread Jay
I think I found the issue, Hive metastore 2.3.6 doesn't have the necessary support. After upgrading to Hive 3.1.2 I was able to run the select query. On Sun, 20 Dec 2020 at 12:00, Jay wrote: > Thanks Matt. > > I have set the two configs in my sparkConfig as below &g

Re: Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-19 Thread Jay
uot;SELECT * FROM $tableName") res2: org.apache.spark.sql.DataFrame = [col1: int] But when I try to do .show() it returns an error scala> spark.sql(s"SELECT * FROM $tableName").show() org.apache.spark.sql.AnalysisException: Table does not support reads: default.tblname_3;

Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-19 Thread Jay
Hi All - I have currently setup a Spark 3.0.1 cluster with delta version 0.7.0 which is connected to an external hive metastore. I run the below set of commands :- val tableName = tblname_2 spark.sql(s"CREATE TABLE $tableName(col1 INTEGER) USING delta options(path='GCS_PATH')") *20/12/19

Spark-hive integration on HDInsight

2019-02-20 Thread Jay Singh
I am trying to integrate spark with hive on HDInsight spark cluster . I copied hive-site.xml in spark/conf directory. In addition I added hive metastore properties like jdbc connection info on Ambari as well. But still the database and tables created using spark-sql are not visible in hive.

Re: Reg:- Py4JError in Windows 10 with Spark

2018-06-06 Thread Jay
Are you running this in local mode or cluster mode ? If you are running in cluster mode have you ensured that numpy is present on all nodes ? On Tue 5 Jun, 2018, 2:43 AM @Nandan@, wrote: > Hi , > I am getting error :- > >

Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread Jay
I might have missed it but can you tell if the OOM is happening in driver or executor ? Also it would be good if you can post the actual exception. On Tue 5 Jun, 2018, 1:55 PM Nicolas Paris, wrote: > IMO your json cannot be read in parallell at all then spark only offers > you > to play again

Re: spark partitionBy with partitioned column in json output

2018-06-04 Thread Jay
The partitionBy clause is used to create hive folders so that you can point a hive partitioned table on the data . What are you using the partitionBy for ? What is the use case ? On Mon 4 Jun, 2018, 4:59 PM purna pradeep, wrote: > im reading below json in spark > > {"bucket": "B01",

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jay
Can you tell us what version of Spark you are using and if Dynamic Allocation is enabled ? Also, how are the files being read ? Is it a single read of all files using a file matching regex or are you running different threads in the same pyspark job? On Mon 4 Jun, 2018, 1:27 PM Shuporno

Re: Append In-Place to S3

2018-06-02 Thread Jay
Benjamin, The append will append the "new" data to the existing data with removing the duplicates. You would need to overwrite the file everytime if you need unique values. Thanks, Jayadeep On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote: > I have a situation where I trying to add only new

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely

Re: spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Hi, The problem has been solved simply by updating the scala sdk version from incompactible 2.10.x to correct version 2.11.x From: Michael Jay <mich@outlook.com> Sent: Tuesday, August 9, 2016 10:11:12 PM To: user@spark.apache.org Subject: spa

spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Dear all, I am Newbie to Spark. Currently I am trying to import the source code of Spark 2.0 as a Module to an existing client project. I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies in this client project. During compilation errors as missing

JavaSparkContext: dependency on ui/

2016-06-27 Thread jay vyas
arkContext.(JavaSparkContext.scala:58) -- jay vyas

Spark jobs without a login

2016-06-16 Thread jay vyas
not too worries about this - but it seems like it might be nice if maybe we could specify a user name as part of sparks context or as part of an external parameter rather then having to use the java based user/group extractor. -- jay vyas

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Thank you, that helps a lot. On Mon, Feb 22, 2016 at 6:01 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > You're correct, reduceByKey is just an example. > > On Tue, Feb 23, 2016 at 10:57 AM, Jay Luan <jaylu...@gmail.com> wrote: > >> Could you elaborate on

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Could you elaborate on how this would work? So from what I can tell, this maps a key to a tuple which always has a 0 as the second element. From there the hash widely changes because we now hash something like ((1,4), 0) and ((1,3), 0). Thus mapping this would create more even partitions. Why

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Jay Luan
Thanks for the reply, I'd like to export the decision splits for each tree out to an external file which is read elsewhere not using spark. As far as I know, saving a model to a path will save a bunch of binary files which can be loaded back into spark. Is this correct? On Feb 10, 2016 7:21 PM,

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-08 Thread Shipper, Jay [USA]
determine the root cause, as I cannot replicate this issue. Thanks for your help. From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> Date: Friday, February 5, 2016 at 5:40 PM To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>> Cc: "user

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
could confirm they’re having this issue with Spark 1.6.0. Ideally, we should also have some simple proof of concept that can be posted with the bug. From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> Date: Wednesday, February 3, 2016 at 3:57 PM To: Jay Shipper <ship

Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
to match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT). Thanks, Jay

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
016 at 12:04 PM To: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: [External] Re: Spark 1.6.0 HiveContext NPE Looks li

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
One quick update on this: The NPE is not happening with Spark 1.5.2, so this problem seems specific to Spark 1.6.0. From: Jay Shipper <shipper_...@bah.com<mailto:shipper_...@bah.com>> Date: Wednesday, February 3, 2016 at 12:06 PM To: "user@spark.apache.org<mailto:user@spark

Re: How to run two operations on the same RDD simultaneously

2015-11-25 Thread Jay Luan
Ah, thank you so much, this is perfect On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU wrote: > You can try to use an Accumulator ( > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator) > to keep count in map1. Note that the final

Re: insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-08-20 Thread John Jay
The answer is that my table was not serialized by kyro,but I started spark-sql shell with kyro,so the data could not be deserialized。 -- View this message in context:

Re: Unit Testing

2015-08-13 Thread jay vyas
a producer and a consumer, so that you don't get a starvation scenario. On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to run spark streaming methods in standalone eclipse environment to test out the functionality? -- jay vyas

Re: Amazon DynamoDB Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . On Aug 7, 2015, at 9:08 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, Is there a way using DynamoDB in spark application? I have to

Re: How to build Spark with my own version of Hadoop?

2015-07-22 Thread jay vyas
, 2015 at 11:11 PM, Dogtail Ray spark.ru...@gmail.com wrote: Hi, I have modified some Hadoop code, and want to build Spark with the modified version of Hadoop. Do I need to change the compilation dependency files? How to then? Great thanks! -- jay vyas

insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-07-02 Thread John Jay
My spark-sql command: spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf spark.driver.cores=20 --conf spark.cores.max=20 --conf spark.executor.memory=2g --conf spark.driver.memory=2g --conf spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
For additional commands, e-mail: user-h...@spark.apache.org -- jay vyas

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
/StreamingContext.html. The information on consumed offset can be recovered from the checkpoint. On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com wrote: If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will it still start to consume the data that were

Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
to do with it. On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay bill.jaypeter...@gmail.com wrote: This function is called in foreachRDD. I think it should be running in the executors. I add the statement.close() in the code and it is running. I will let you know if this fixes the issue. On Wed

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
each batch. On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger c...@koeninger.org wrote: Did you use lsof to see what files were opened during the job? On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: The data ingestion is in outermost portion in foreachRDD block

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example:

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
...@gmail.com wrote: Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
. Please let me know If I am doing something wrong. -- jay vyas

Re: org.apache.spark.ml.recommendation.ALS

2015-04-13 Thread Jay Katukuri
2.11.2. Could it be that spark-1.3.0-bin-hadoop2.4.tgz requires a different version of scala ? Thanks, Jay On Apr 9, 2015, at 4:38 PM, Xiangrui Meng men...@gmail.com wrote: Could you share ALSNew.scala? Which Scala version did you use? -Xiangrui On Wed, Apr 8, 2015 at 4:09 PM, Jay

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri
-submit from my downloaded spark-1.30. On Apr 6, 2015, at 1:37 PM, Jay Katukuri jkatuk...@apple.com wrote: Here is the command that I have used : spark-submit —class packagename.ALSNew --num-executors 100 --master yarn ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path Btw - I

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
-Xiangrui On Mon, Apr 6, 2015 at 12:27 PM, Jay Katukuri jkatuk...@apple.com wrote: Hi, Here is the stack trace: Exception in thread main java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at ALSNew

org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
= purchase.map ( line = line.split(',') match { case Array(user, item, rate) = (user.toInt, item.toInt, rate.toFloat) }).toDF() Any help is appreciated ! I have tried passing the spark-sql jar using the -jar spark-sql_2.11-1.3.0.jar Thanks, Jay On Mar 17, 2015, at 12:50 PM

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Jay On Apr 6, 2015, at 12:24 PM, Xiangrui Meng men...@gmail.com wrote: Please attach the full stack trace. -Xiangrui On Mon, Apr 6, 2015 at 12:06 PM, Jay

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- jay vyas

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that you can't rely on memory in distributed analytics...now maybe we are challenging the assumption that big data analytics need to distributed? I've been asking the same question lately and seen similarly that

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
(PoolingHttpClientConnectionManager.java:114) -- jay vyas

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers

Re: Spark Streaming and message ordering

2015-02-18 Thread jay vyas
! -- jay vyas

Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language integrated queries) which looks quite similar to what i was thinking about. I'll give that a whirl... On Wed, Feb 11, 2015 at 7:40 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark

Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
).by(product,meta=product.id=meta.id). toSchemaRDD ? I know the above snippet is totally wacky but, you get the idea :) -- jay vyas

SparkSQL DateTime

2015-02-09 Thread jay vyas
just for dealing with time stamps. Whats the simplest and cleanest way to map non-spark time values into SparkSQL friendly time values? - One option could be a custom SparkSQL type, i guess? - Any plan to have native spark sql support for Joda Time or (yikes) java.util.Calendar ? -- jay vyas

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
performing any operations on it. That way, the EdgeRDDImpl class doesn't have to use the default partitioner. Hope this helps! Jay On Tue Feb 03 2015 at 12:35:14 AM NicolasC nicolas.ch...@inria.fr wrote: On 01/29/2015 08:31 PM, Ankur Dave wrote: Thanks for the reminder. I just created a PR

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Jay Hutfles
Just curious, is this set to be merged at some point? On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave ankurd...@gmail.com wrote: At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas
Its a very valid idea indeed, but... It's a tricky subject since the entire ASF is run on mailing lists , hence there are so many different but equally sound ways of looking at this idea, which conflict with one another. On Jan 21, 2015, at 7:03 AM, btiernay btier...@hotmail.com wrote: I

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Jay Vyas
I find importing a working SBT project into IntelliJ is the way to go. How did you load the project into intellij? On Jan 13, 2015, at 4:45 PM, Enno Shioji eshi...@gmail.com wrote: Had the same issue. I can't remember what the issue was but this works: libraryDependencies ++= {

Re: Spark Streaming Threading Model

2014-12-19 Thread jay vyas
it just process them? Asim -- jay vyas

spark-shell bug with RDD distinct?

2014-12-19 Thread Jay Hutfles
. Is this something I just have to live with when using the REPL? Or is this covered by something bigger that's being addressed? Thanks in advance -Jay

spark-shell bug with RDDs and case classes?

2014-12-19 Thread Jay Hutfles
Found a problem in the spark-shell, but can't confirm that it's related to open issues on Spark's JIRA page. I was wondering if anyone could help identify if this is an issue or if it's already being addressed. Test: (in spark-shell) case class Person(name: String, age: Int) val peopleList =

Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo On this note, I've built a framework which is mostly pure so that functional unit tests can be run composing mock data for Twitter statuses, with just regular junit... That might be relevant also. I think at some point we should come

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
is a must. (And soon we will have that on the metrics report as well!! :-) ) -kr, Gerard. On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
this, is by asking Spark Streaming remember stuff for longer, by using streamingContext.remember(duration). This will ensure that Spark Streaming will keep around all the stuff for at least that duration. Hope this helps. TD On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com

Re: How to execute a custom python library on spark

2014-11-25 Thread jay vyas
if one can point an example library and how to run it :) Thanks -- jay vyas

Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31

Error when Spark streaming consumes from Kafka

2014-11-22 Thread Bill Jay
Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Jay Vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell Are you using the SBT plugin? If not how are you putting deps into IntelliJ? On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote:

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay
there’s the a parameter in KafkaUtils.createStream you can specify the spark parallelism, also what is the exception stacks. Thanks Jerry *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com] *Sent:* Tuesday, November 18, 2014 2:47 AM *To:* Helena Edelson *Cc:* Jay Vyas; u

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay
Hi all, I am running a Spark Streaming job. It was able to produce the correct results up to some time. Later on, the job was still running but producing no result. I checked the Spark streaming UI and found that 4 tasks of a stage failed. The error messages showed that Job aborted due to stage

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson helena.edel...@datastax.com wrote: I encounter no issues with streaming from kafka to spark in 1.1.0. Do you perhaps have a version conflict? Helena On Nov 13, 2014 12:55 AM, Jay Vyas jayunit100.apa...@gmail.com wrote: Yup , very important

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
that only after all the batch data was in? Thanks -- jay vyas

Re: Streaming: getting total count over all windows

2014-11-13 Thread jay vyas
...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- jay vyas

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] *Sent:* Thursday, November 13, 2014 8:45 AM *To:* Bill Jay *Cc:* u...@spark.incubator.apache.org *Subject:* Re: Spark streaming cannot

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
be local[n], n 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, November 13, 2014 8:45 AM To: Bill Jay Cc: u

Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote: When

Spark streaming job failed due to java.util.concurrent.TimeoutException

2014-11-03 Thread Bill Jay
Hi all, I have a spark streaming job that consumes data from Kafka and produces some simple operations on the data. This job is run in an EMR cluster with 10 nodes. The batch size I use is 1 minute and it takes around 10 seconds to generate the results that are inserted to a MySQL database.

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
of filtering the data collection times the #buckets? thanks, Gerard. -- jay vyas

Re: real-time streaming

2014-10-28 Thread jay vyas
-mail: user-h...@spark.apache.org -- jay vyas

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
, Oct 21, 2014 at 11:02 AM, jay vyas jayunit100.apa...@gmail.com wrote: Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
be very useful. -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- jay vyas

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data

Re: Does Ipython notebook work with spark? trivial example does not work. Re: bug with IPython notebook?

2014-10-10 Thread jay vyas
=pyStreamingSparkRDDPipe”) data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) def echo(data): print python recieved: %s % (data) # output winds up in the shell console in my cluster (ie. The machine I launched pyspark from) rdd.foreach(echo) print we are done -- jay vyas

Re: Spark inside Eclipse

2014-10-03 Thread jay vyas
AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io -- jay vyas

Re: questions about MLLib recommendation models

2014-08-08 Thread Jay Hutfles
suggested: val predictions = model.predict(data.map(x = (x.user, x.product))) Best, Xiangrui On Thu, Aug 7, 2014 at 1:20 PM, Burak Yavuz bya...@stanford.edu wrote: Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing

questions about MLLib recommendation models

2014-08-07 Thread Jay Hutfles
for the long ramble. Jay

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- jay vyas

Re: combineByKey at ShuffledDStream.scala

2014-07-23 Thread Bill Jay
, 2014 at 10:05 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you give an idea of the streaming program? Rest of the transformation you are doing on the input streams? On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently running

Re: Spark Streaming: no job has started yet

2014-07-23 Thread Bill Jay
ak...@sigmoidanalytics.com wrote: Can you paste the piece of code? Thanks Best Regards On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4

Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc. I think this is the timestamp for the beginning of each

Re: Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
:39 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc

  1   2   >