question about spark streaming
hi guys, i have a question about spark streaming. There’s an application keep sending transaction records into spark stream with about 50k tps The record represents a sales information including customer id / product id / time / price columns The application is required to monitor the change of price for each product. For example, if the price of a product increases 10% within 3 minutes, it will send an alert to end user. The interval is required to be set every 1 second, window is somewhere between 180 to 300 seconds. The issue is that I have to compare the price of each transaction ( totally about 10k different products ) against the lowest/highest price for the same product in the all past 180 seconds. That means, in every single second, I have to loop through 50k transactions and compare the price of the same product in all 180 seconds. So it seems I have to separate the calculation based on product id, so that each worker only processes a certain list of products. For example, if I can make sure the same product id always go to the same worker agent, it doesn’t need to shuffle data between worker agent for each comparison. Otherwise if it required to compare each transaction with all other RDDs that cross multiple worker agent, I guess it may not be fast enough for the requirement. Is there anyone knows how to specify the worker node for each transaction record based on its product id, in order to avoid massive shuffle operation? If simply making the product id as the key and price as the value, reduceByKeyAndWindow may cause massive shuffle and slow down the whole throughput. Am I correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
question about sparksql caching
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data of each table that used in the SQL? Since the user may have various of SQLs that use those tables, if the caching is for the final result only, it may still take very long time to scan the entire table if it’s a brand new SQL. If this is the case, is there any other better way to cache the base tables instead of final result? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
question about sparksql caching
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data of each table that used in the SQL? Since the user may have various of SQLs that use those tables, if the caching is for the final result only, it may still take very long time to scan the entire table if it’s a brand new SQL. If this is the case, is there any other better way to cache the base tables instead of final result? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
what is the best way to transfer data from RDBMS to spark?
If I run spark in stand-alone mode ( not YARN mode ), is there any tool like Sqoop that able to transfer data from RDBMS to spark storage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to specify multiple masters in sbin/start-slaves.sh script?
Hey guys, Not sure if i’m the only one got this. We are building high-available standalone spark env. We are using ZK with 3 masters in the cluster. However, in sbin/start-slaves.sh, it calls start-slave.sh for each member in conf/slaves file, and specify master using $SPARK_MASTER_IP and $SPARK_MASTER_PORT exec $sbin/slaves.sh cd $SPARK_HOME \; $sbin/start-slave.sh 1 spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT But if I want to specify more than one master node, I have to use the format spark://host1:port1,host2:port2,host3:port3 spark://host1:port1,host2:port2,host3:port3 In this case, it seems the original sbin/start-slaves.sh can’t do the trick. Does everyone need to modify the script in order to build a HA cluster, or is there something I missed? Thanks
sparksql native jdbc driver
hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
building all modules in spark by mvn
guys, is there any easier way to build all modules by mvn ? right now if I run “mvn package” in spark root directory I got: [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 8.327 s] [INFO] Spark Project Networking ... SKIPPED [INFO] Spark Project Shuffle Streaming Service SKIPPED [INFO] Spark Project Core . SKIPPED [INFO] Spark Project Bagel SKIPPED [INFO] Spark Project GraphX ... SKIPPED [INFO] Spark Project Streaming SKIPPED [INFO] Spark Project Catalyst . SKIPPED [INFO] Spark Project SQL .. SKIPPED [INFO] Spark Project ML Library ... SKIPPED … Apprently only Parent project is built and all other children projects are skipped. I can get sparksql/stream projects built by sbt/sbt, but if I’d like to use mvn and do not want to build each dependent module separately, is there any good way to do it? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Unable to stop Worker in standalone mode by sbin/stop-all.sh
Checking the script, it seems spark-daemon.sh unable to stop the worker $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1 no org.apache.spark.deploy.worker.Worker to stop $ ps -elf | grep spark 0 S taoewang 24922 1 0 80 0 - 733878 futex_ Mar12 ? 00:08:54 java -cp /data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181 -Dspark.deploy.zookeeper.dir=/data/zookeeper -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://centos-151:7077,centos-152:7077,centos-153:7077 In spark-daemon script it tries to find $pid in /tmp/: pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid” In my case pid supposed to be: /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid However when I go through the files in /tmp directory I don’t find such file exist. I got 777 on /tmp and also tried to touch a file with my current account and success, so it shouldn’t be permission issue. $ ls -la / | grep tmp drwxrwxrwx. 6 root root 4096 Mar 13 08:19 tmp Anyone has any idea why the pid file didn’t show up? Thanks TW - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh
Nope, I can see the master file exist but not the worker: $ ls bitrock_installer.log hsperfdata_root hsperfdata_taoewang omatmp sbt2435921113715137753.log spark-taoewang-org.apache.spark.deploy.master.Master-1.pid 在 2015年3月13日,上午9:34,Ted Yu yuzhih...@gmail.com 写道: Does the machine have cron job that periodically cleans up /tmp dir ? Cheers On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com mailto:mailing-list-r...@sequoiadb.com wrote: Checking the script, it seems spark-daemon.sh unable to stop the worker $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1 no org.apache.spark.deploy.worker.Worker to stop $ ps -elf | grep spark 0 S taoewang 24922 1 0 80 0 - 733878 futex_ Mar12 ? 00:08:54 java -cp /data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181 -Dspark.deploy.zookeeper.dir=/data/zookeeper -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://centos-151:7077,centos-152:7077,centos-153:7077 In spark-daemon script it tries to find $pid in /tmp/: pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid” In my case pid supposed to be: /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid However when I go through the files in /tmp directory I don’t find such file exist. I got 777 on /tmp and also tried to touch a file with my current account and success, so it shouldn’t be permission issue. $ ls -la / | grep tmp drwxrwxrwx. 6 root root 4096 Mar 13 08:19 tmp Anyone has any idea why the pid file didn’t show up? Thanks TW - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org